* [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK @ 2024-05-29 12:04 Pablo Correa Gomez 2024-05-29 13:15 ` Rich Felker 0 siblings, 1 reply; 4+ messages in thread From: Pablo Correa Gomez @ 2024-05-29 12:04 UTC (permalink / raw) To: musl Hi everybody, I am responsible for musl CI in GNOME's GLib, and we have recently bumped into a crash that I have been unable to resolve. https://gitlab.gnome.org/GNOME/glib/-/commit/137db219a7266300ffde1aa75d781284fb0807cb introduced in GLib an alternate stack by setting the signal action SA_ONSTACK if available. However, the tests that were introduced, and that pass in most other libc's (there's CI for a lot more than just glibc and musl) crash in my alpine linux edge installation with SIGSEGV (stack trace below) while doing: kill (getpid(), SIGHUP) I have verified that not adding SA_ONSTACK fixes the crash. Would anybody have some pointers of what could possibly be going wrong? If anybody is really interested, the public issue is https://gitlab.gnome.org/GNOME/glib/-/issues/3315 Stack trace ------------ Thread 1 "unix" received signal SIGSEGV, Segmentation fault. 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at ./arch/x86_64/syscall_arch.h:21 warning: 21 ./arch/x86_64/syscall_arch.h: No such file or directory (gdb) bt #0 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at ./arch/x86_64/syscall_arch.h:21 #1 kill (pid=17483, sig=sig@entry=1) at src/signal/kill.c:6 #2 0x0000555555556e96 in test_signal (signum=signum@entry=1) at ../glib/tests/unix.c:534 #3 0x0000555555557200 in test_signal_alternate_stack (signal=1) at ../glib/tests/unix.c:590 #4 0x00007ffff7e8f364 in test_case_run (path=<optimized out>, test_run_name=0x55555555d3f0 "/glib-unix/sighup/alternate-stack", tc=0x55555555db60) at ../glib/gtestutils.c:2988 #5 g_test_run_suite_internal (suite=suite@entry=0x55555555da70, path=path@entry=0x0) at ../glib/gtestutils.c:3090 #6 0x00007ffff7e8f2db in g_test_run_suite_internal (suite=suite@entry=0x7ffff7ffee20, path=path@entry=0x0) at ../glib/gtestutils.c:3109 #7 0x00007ffff7e8f2db in g_test_run_suite_internal (suite=suite@entry=0x7ffff7ffede0, path=path@entry=0x0) at ../glib/gtestutils.c:3109 #8 0x00007ffff7e8f86a in g_test_run_suite (suite=suite@entry=0x7ffff7ffede0) at ../glib/gtestutils.c:3189 #9 0x00007ffff7e8f8ea in g_test_run () at ../glib/gtestutils.c:2275 #10 0x00005555555561f7 in main (argc=<optimized out>, argv=<optimized out>) at ../glib/tests/unix.c:910 Best and thanks for your time, Pablo. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK 2024-05-29 12:04 [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK Pablo Correa Gomez @ 2024-05-29 13:15 ` Rich Felker 2024-05-30 10:17 ` Pablo Correa Gomez 0 siblings, 1 reply; 4+ messages in thread From: Rich Felker @ 2024-05-29 13:15 UTC (permalink / raw) To: Pablo Correa Gomez; +Cc: musl On Wed, May 29, 2024 at 02:04:25PM +0200, Pablo Correa Gomez wrote: > Hi everybody, > > I am responsible for musl CI in GNOME's GLib, and we have recently > bumped into a crash that I have been unable to resolve. > > https://gitlab.gnome.org/GNOME/glib/-/commit/137db219a7266300ffde1aa75d781284fb0807cb > introduced in GLib an alternate stack by setting the signal action > SA_ONSTACK if available. However, the tests that were introduced, and > that pass in most other libc's (there's CI for a lot more than just > glibc and musl) crash in my alpine linux edge installation with SIGSEGV > (stack trace below) while doing: kill (getpid(), SIGHUP) > > I have verified that not adding SA_ONSTACK fixes the crash. Would > anybody have some pointers of what could possibly be going wrong? If > anybody is really interested, the public issue is > https://gitlab.gnome.org/GNOME/glib/-/issues/3315 > > Stack trace > ------------ > > Thread 1 "unix" received signal SIGSEGV, Segmentation fault. > 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at > ../arch/x86_64/syscall_arch.h:21 > warning: 21 ./arch/x86_64/syscall_arch.h: No such file or directory > (gdb) bt > #0 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at > ../arch/x86_64/syscall_arch.h:21 > #1 kill (pid=17483, sig=sig@entry=1) at src/signal/kill.c:6 > #2 0x0000555555556e96 in test_signal (signum=signum@entry=1) at > .../glib/tests/unix.c:534 > #3 0x0000555555557200 in test_signal_alternate_stack (signal=1) at > .../glib/tests/unix.c:590 > #4 0x00007ffff7e8f364 in test_case_run (path=<optimized out>, > test_run_name=0x55555555d3f0 "/glib-unix/sighup/alternate-stack", > tc=0x55555555db60) at ../glib/gtestutils.c:2988 > #5 g_test_run_suite_internal (suite=suite@entry=0x55555555da70, > path=path@entry=0x0) at ../glib/gtestutils.c:3090 > #6 0x00007ffff7e8f2db in g_test_run_suite_internal > (suite=suite@entry=0x7ffff7ffee20, path=path@entry=0x0) at > .../glib/gtestutils.c:3109 > #7 0x00007ffff7e8f2db in g_test_run_suite_internal > (suite=suite@entry=0x7ffff7ffede0, path=path@entry=0x0) at > .../glib/gtestutils.c:3109 > #8 0x00007ffff7e8f86a in g_test_run_suite > (suite=suite@entry=0x7ffff7ffede0) at ../glib/gtestutils.c:3189 > #9 0x00007ffff7e8f8ea in g_test_run () at ../glib/gtestutils.c:2275 > #10 0x00005555555561f7 in main (argc=<optimized out>, argv=<optimized > out>) at ../glib/tests/unix.c:910 Can you get a disassembly and register dump at the point of crash? My best guess is that this is a simple stack overflow. There's really not any other plausible reason for a segfault in kill(). The only operations that touch memory in it are (on my build at least) a push to realign the stack, and a call to __syscall_ret. I'm not sure if the crashing code is running on the signal stack or main stack, but here's a thought: is it possible the CI machines are running on a cpu/kernel with some monster AVX512 or whatever extension enabled with register file that doesn't fit in MINSIGSTKSZ? If so, using sysconf(_SC_MINSIGSTKSZ) (conditional on _SC_MINSIGSTKSZ being defined) to allocate the alt stack should mitigate the problem. If doing this, it should probably be allocated by mmap or malloc, since in principle it could be too large for the caller's stack. It's also possible that the kernel may have some weird behavior deciding if the task is already "running on the alt stack" when the alt stack is embedded in the normal stack like this. Just getting rid of that might be worth trying. If so, whether the problem manifests could be subject to timing of signal delivery (although I would not expect that for synchronously generated signals like here). Rich ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK 2024-05-29 13:15 ` Rich Felker @ 2024-05-30 10:17 ` Pablo Correa Gomez 2024-05-30 11:51 ` Markus Wichmann 0 siblings, 1 reply; 4+ messages in thread From: Pablo Correa Gomez @ 2024-05-30 10:17 UTC (permalink / raw) To: Rich Felker; +Cc: musl Hi Rich, thanks a lot for your reply El mie, 29-05-2024 a las 09:15 -0400, Rich Felker escribió: > On Wed, May 29, 2024 at 02:04:25PM +0200, Pablo Correa Gomez wrote: > > Hi everybody, > > > > I am responsible for musl CI in GNOME's GLib, and we have recently > > bumped into a crash that I have been unable to resolve. > > > > https://gitlab.gnome.org/GNOME/glib/- > > /commit/137db219a7266300ffde1aa75d781284fb0807cb > > introduced in GLib an alternate stack by setting the signal action > > SA_ONSTACK if available. However, the tests that were introduced, > > and > > that pass in most other libc's (there's CI for a lot more than just > > glibc and musl) crash in my alpine linux edge installation with > > SIGSEGV > > (stack trace below) while doing: kill (getpid(), SIGHUP) > > > > I have verified that not adding SA_ONSTACK fixes the crash. Would > > anybody have some pointers of what could possibly be going wrong? > > If > > anybody is really interested, the public issue is > > https://gitlab.gnome.org/GNOME/glib/-/issues/3315 > > > > Stack trace > > ------------ > > > > Thread 1 "unix" received signal SIGSEGV, Segmentation fault. > > 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at > > ../arch/x86_64/syscall_arch.h:21 > > warning: 21 ./arch/x86_64/syscall_arch.h: No such file or > > directory > > (gdb) bt > > #0 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at > > ../arch/x86_64/syscall_arch.h:21 > > #1 kill (pid=17483, sig=sig@entry=1) at src/signal/kill.c:6 > > #2 0x0000555555556e96 in test_signal (signum=signum@entry=1) at > > .../glib/tests/unix.c:534 > > #3 0x0000555555557200 in test_signal_alternate_stack (signal=1) at > > .../glib/tests/unix.c:590 > > #4 0x00007ffff7e8f364 in test_case_run (path=<optimized out>, > > test_run_name=0x55555555d3f0 "/glib-unix/sighup/alternate-stack", > > tc=0x55555555db60) at ../glib/gtestutils.c:2988 > > #5 g_test_run_suite_internal (suite=suite@entry=0x55555555da70, > > path=path@entry=0x0) at ../glib/gtestutils.c:3090 > > #6 0x00007ffff7e8f2db in g_test_run_suite_internal > > (suite=suite@entry=0x7ffff7ffee20, path=path@entry=0x0) at > > .../glib/gtestutils.c:3109 > > #7 0x00007ffff7e8f2db in g_test_run_suite_internal > > (suite=suite@entry=0x7ffff7ffede0, path=path@entry=0x0) at > > .../glib/gtestutils.c:3109 > > #8 0x00007ffff7e8f86a in g_test_run_suite > > (suite=suite@entry=0x7ffff7ffede0) at ../glib/gtestutils.c:3189 > > #9 0x00007ffff7e8f8ea in g_test_run () at > > ../glib/gtestutils.c:2275 > > #10 0x00005555555561f7 in main (argc=<optimized out>, > > argv=<optimized > > out>) at ../glib/tests/unix.c:910 > > Can you get a disassembly and register dump at the point of crash? (gdb) layout asm 0x7ffff7fa96f9 <kill+7> movslq %esi,%rsi 0x7ffff7fa96fc <kill+10> mov $0x3e,%eax 0x7ffff7fa9701 <kill+15> syscall >0x7ffff7fa9703 <kill+17> mov %rax,%rdi 0x7ffff7fa9706 <kill+20> call 0x7ffff7f7afb7 <__syscall_ret> 0x7ffff7fa970b <kill+25> add $0x8,%rsp 0x7ffff7fa970f <kill+29> ret 0x7ffff7fa9710 <killpg> test %edi,%edi 0x7ffff7fa9712 <killpg+2> js 0x7ffff7fa971b <killpg+11> 0x7ffff7fa9714 <killpg+4> neg %edi 0x7ffff7fa9716 <killpg+6> jmp 0x7ffff7fa96f2 <kill> 0x7ffff7fa971b <killpg+11> sub $0x8,%rsp 0x7ffff7fa971f <killpg+15> call 0x7ffff7f78bae <__errno_location> 0x7ffff7fa9724 <killpg+20> movl $0x16,(%rax) 0x7ffff7fa972a <killpg+26> mov $0xffffffff,%eax 0x7ffff7fa972f <killpg+31> add $0x8,%rsp 0x7ffff7fa9733 <killpg+35> ret 0x7ffff7fa9734 <psiginfo> mov (%rdi),%edi 0x7ffff7fa9736 <psiginfo+2> jmp 0x7ffff7fa973b <psignal> 0x7ffff7fa973b <psignal> push %r15 0x7ffff7fa973d <psignal+2> push %r14 0x7ffff7fa973f <psignal+4> push %r13 0x7ffff7fa9741 <psignal+6> lea 0x51938(%rip),%r13 # 0x7ffff7ffb080 <__stderr_FILE> 0x7ffff7fa9748 <psignal+13> push %r12 0x7ffff7fa974a <psignal+15> xor %r12d,%r12d 0x7ffff7fa974d <psignal+18> push %rbp 0x7ffff7fa974e <psignal+19> push %rbx 0x7ffff7fa974f <psignal+20> mov %rsi,%rbx 0x7ffff7fa9752 <psignal+23> sub $0x18,%rsp 0x7ffff7fa9756 <psignal+27> call 0x7ffff7fb5780 <strsignal> (gdb) info registers rax 0x0 0 rbx 0x7ffff7f55c30 140737353440304 rcx 0x7ffff7fa9703 140737353783043 rdx 0x0 0 rsi 0x1 1 rdi 0x525e 21086 rbp 0x1 0x1 rsp 0x7fffffffd5d0 0x7fffffffd5d0 r8 0x0 0 r9 0x80 128 r10 0x8 8 r11 0x202 514 r12 0x7ffff7ffdb5c 140737354128220 r13 0x1 1 r14 0x7fffffffd6d0 140737488344784 r15 0x7fffffffd6f0 140737488344816 rip 0x7ffff7fa9703 0x7ffff7fa9703 <kill+17> eflags 0x202 [ IF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 fs_base 0x7ffff7ffdb28 140737354128168 gs_base 0x0 0 Does this tell you anything? > I'm not sure if the crashing code is running on the signal stack or > main stack, but here's a thought: is it possible the CI machines are > running on a cpu/kernel with some monster AVX512 or whatever > extension > enabled with register file that doesn't fit in MINSIGSTKSZ? That might be the case. Would explain why I could not reproduce in my 9-year old laptop I was running last month, but I can reproduce it now in a new machine with a 13th Gen Intel(R) Core(TM) i7-1360P > If so, > using sysconf(_SC_MINSIGSTKSZ) (conditional on _SC_MINSIGSTKSZ being > defined) to allocate the alt stack should mitigate the problem. If > doing this, it should probably be allocated by mmap or malloc, since > in principle it could be too large for the caller's stack. > I'll forward this to the maintainers, let's see if we can come up with a solution. Thanks a lot for your feedback! > It's also possible that the kernel may have some weird behavior > deciding if the task is already "running on the alt stack" when the > alt stack is embedded in the normal stack like this. Just getting rid > of that might be worth trying. If so, whether the problem manifests > could be subject to timing of signal delivery (although I would not > expect that for synchronously generated signals like here). > > Rich ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK 2024-05-30 10:17 ` Pablo Correa Gomez @ 2024-05-30 11:51 ` Markus Wichmann 0 siblings, 0 replies; 4+ messages in thread From: Markus Wichmann @ 2024-05-30 11:51 UTC (permalink / raw) To: musl; +Cc: Pablo Correa Gomez Am Thu, May 30, 2024 at 12:17:59PM +0200 schrieb Pablo Correa Gomez: > El mie, 29-05-2024 a las 09:15 -0400, Rich Felker escribió: > > On Wed, May 29, 2024 at 02:04:25PM +0200, Pablo Correa Gomez wrote: > > > Thread 1 "unix" received signal SIGSEGV, Segmentation fault. > > > 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at > > > ../arch/x86_64/syscall_arch.h:21 > (gdb) layout asm > > 0x7ffff7fa96f9 <kill+7> movslq %esi,%rsi > 0x7ffff7fa96fc <kill+10> mov $0x3e,%eax > 0x7ffff7fa9701 <kill+15> syscall > >0x7ffff7fa9703 <kill+17> mov %rax,%rdi > [...] > Does this tell you anything? > It tells me that Rich's reasoning was correct. I'll explain further down. > > I'm not sure if the crashing code is running on the signal stack or > > main stack, but here's a thought: is it possible the CI machines are > > running on a cpu/kernel with some monster AVX512 or whatever > > extension > > enabled with register file that doesn't fit in MINSIGSTKSZ? > > That might be the case. Would explain why I could not reproduce in my > 9-year old laptop I was running last month, but I can reproduce it now > in a new machine with a 13th Gen Intel(R) Core(TM) i7-1360P > That is exactly what the program is doing, according to the link you provided in the OP. > > It's also possible that the kernel may have some weird behavior > > deciding if the task is already "running on the alt stack" when the > > alt stack is embedded in the normal stack like this. Just getting rid > > of that might be worth trying. If so, whether the problem manifests > > could be subject to timing of signal delivery (although I would not > > expect that for synchronously generated signals like here). > > Thankfully, we needn't speculate, as Linux is open source. The function get_sigframe() will determine if the thread is currently executing on the signal stack. It does that by determining that the sp is between stack base and stack top. If that isn't the case, it will allocate a red zone, else it will start at the top of the altstack. It will then try to allocate a full frame. If that doesn't work (because it already was on an altstack that got overflowed, or it tried to enter too small of an altstack), then it will generate a message "overflowed sigaltstack", that you might find in dmesg, before returning a bogus address. Due to the bogus address, all calls to unsafe_put_user() in x64_setup_rt_frame() will fail, and it will return EFAULT. This error will be reported to signal_setup_done() and it will call force_sigsegv(), which then reports a SIGSEGV at the "current" IP. Since this all happens during a syscall, the current IP is the one directly following the syscall instruction. > > Rich > Ciao, Markus ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-05-30 11:51 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-05-29 12:04 [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK Pablo Correa Gomez 2024-05-29 13:15 ` Rich Felker 2024-05-30 10:17 ` Pablo Correa Gomez 2024-05-30 11:51 ` Markus Wichmann
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).