[musl] Crash in kill(..., SIGHUP) when using SA

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK
@ 2024-05-29 12:04 Pablo Correa Gomez
  2024-05-29 13:15 ` Rich Felker
  0 siblings, 1 reply; 4+ messages in thread
From: Pablo Correa Gomez @ 2024-05-29 12:04 UTC (permalink / raw)
  To: musl

Hi everybody,

I am responsible for musl CI in GNOME's GLib, and we have recently
bumped into a crash that I have been unable to resolve. 

https://gitlab.gnome.org/GNOME/glib/-/commit/137db219a7266300ffde1aa75d781284fb0807cb
introduced in GLib an alternate stack by setting the signal action
SA_ONSTACK if available. However, the tests that were introduced, and
that pass in most other libc's (there's CI for a lot more than just
glibc and musl) crash in my alpine linux edge installation with SIGSEGV
(stack trace below) while doing: kill (getpid(), SIGHUP)

I have verified that not adding SA_ONSTACK fixes the crash. Would
anybody have some pointers of what could possibly be going wrong? If
anybody is really interested, the public issue is
https://gitlab.gnome.org/GNOME/glib/-/issues/3315

Stack trace
------------

Thread 1 "unix" received signal SIGSEGV, Segmentation fault.
0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at
./arch/x86_64/syscall_arch.h:21
warning: 21     ./arch/x86_64/syscall_arch.h: No such file or directory
(gdb) bt
#0  0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at
./arch/x86_64/syscall_arch.h:21
#1  kill (pid=17483, sig=sig@entry=1) at src/signal/kill.c:6
#2  0x0000555555556e96 in test_signal (signum=signum@entry=1) at
../glib/tests/unix.c:534
#3  0x0000555555557200 in test_signal_alternate_stack (signal=1) at
../glib/tests/unix.c:590
#4  0x00007ffff7e8f364 in test_case_run (path=<optimized out>,
test_run_name=0x55555555d3f0 "/glib-unix/sighup/alternate-stack",
tc=0x55555555db60) at ../glib/gtestutils.c:2988
#5  g_test_run_suite_internal (suite=suite@entry=0x55555555da70,
path=path@entry=0x0) at ../glib/gtestutils.c:3090
#6  0x00007ffff7e8f2db in g_test_run_suite_internal
(suite=suite@entry=0x7ffff7ffee20, path=path@entry=0x0) at
../glib/gtestutils.c:3109
#7  0x00007ffff7e8f2db in g_test_run_suite_internal
(suite=suite@entry=0x7ffff7ffede0, path=path@entry=0x0) at
../glib/gtestutils.c:3109
#8  0x00007ffff7e8f86a in g_test_run_suite
(suite=suite@entry=0x7ffff7ffede0) at ../glib/gtestutils.c:3189
#9  0x00007ffff7e8f8ea in g_test_run () at ../glib/gtestutils.c:2275
#10 0x00005555555561f7 in main (argc=<optimized out>, argv=<optimized
out>) at ../glib/tests/unix.c:910

Best and thanks for your time,
Pablo.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK
  2024-05-29 12:04 [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK Pablo Correa Gomez
@ 2024-05-29 13:15 ` Rich Felker
  2024-05-30 10:17   ` Pablo Correa Gomez
  0 siblings, 1 reply; 4+ messages in thread
From: Rich Felker @ 2024-05-29 13:15 UTC (permalink / raw)
  To: Pablo Correa Gomez; +Cc: musl

On Wed, May 29, 2024 at 02:04:25PM +0200, Pablo Correa Gomez wrote:
> Hi everybody,
> 
> I am responsible for musl CI in GNOME's GLib, and we have recently
> bumped into a crash that I have been unable to resolve. 
> 
> https://gitlab.gnome.org/GNOME/glib/-/commit/137db219a7266300ffde1aa75d781284fb0807cb
> introduced in GLib an alternate stack by setting the signal action
> SA_ONSTACK if available. However, the tests that were introduced, and
> that pass in most other libc's (there's CI for a lot more than just
> glibc and musl) crash in my alpine linux edge installation with SIGSEGV
> (stack trace below) while doing: kill (getpid(), SIGHUP)
> 
> I have verified that not adding SA_ONSTACK fixes the crash. Would
> anybody have some pointers of what could possibly be going wrong? If
> anybody is really interested, the public issue is
> https://gitlab.gnome.org/GNOME/glib/-/issues/3315
> 
> Stack trace
> ------------
> 
> Thread 1 "unix" received signal SIGSEGV, Segmentation fault.
> 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at
> ../arch/x86_64/syscall_arch.h:21
> warning: 21     ./arch/x86_64/syscall_arch.h: No such file or directory
> (gdb) bt
> #0  0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at
> ../arch/x86_64/syscall_arch.h:21
> #1  kill (pid=17483, sig=sig@entry=1) at src/signal/kill.c:6
> #2  0x0000555555556e96 in test_signal (signum=signum@entry=1) at
> .../glib/tests/unix.c:534
> #3  0x0000555555557200 in test_signal_alternate_stack (signal=1) at
> .../glib/tests/unix.c:590
> #4  0x00007ffff7e8f364 in test_case_run (path=<optimized out>,
> test_run_name=0x55555555d3f0 "/glib-unix/sighup/alternate-stack",
> tc=0x55555555db60) at ../glib/gtestutils.c:2988
> #5  g_test_run_suite_internal (suite=suite@entry=0x55555555da70,
> path=path@entry=0x0) at ../glib/gtestutils.c:3090
> #6  0x00007ffff7e8f2db in g_test_run_suite_internal
> (suite=suite@entry=0x7ffff7ffee20, path=path@entry=0x0) at
> .../glib/gtestutils.c:3109
> #7  0x00007ffff7e8f2db in g_test_run_suite_internal
> (suite=suite@entry=0x7ffff7ffede0, path=path@entry=0x0) at
> .../glib/gtestutils.c:3109
> #8  0x00007ffff7e8f86a in g_test_run_suite
> (suite=suite@entry=0x7ffff7ffede0) at ../glib/gtestutils.c:3189
> #9  0x00007ffff7e8f8ea in g_test_run () at ../glib/gtestutils.c:2275
> #10 0x00005555555561f7 in main (argc=<optimized out>, argv=<optimized
> out>) at ../glib/tests/unix.c:910

Can you get a disassembly and register dump at the point of crash? My
best guess is that this is a simple stack overflow. There's really not
any other plausible reason for a segfault in kill(). The only
operations that touch memory in it are (on my build at least) a push
to realign the stack, and a call to __syscall_ret.

I'm not sure if the crashing code is running on the signal stack or
main stack, but here's a thought: is it possible the CI machines are
running on a cpu/kernel with some monster AVX512 or whatever extension
enabled with register file that doesn't fit in MINSIGSTKSZ? If so,
using sysconf(_SC_MINSIGSTKSZ) (conditional on _SC_MINSIGSTKSZ being
defined) to allocate the alt stack should mitigate the problem. If
doing this, it should probably be allocated by mmap or malloc, since
in principle it could be too large for the caller's stack.

It's also possible that the kernel may have some weird behavior
deciding if the task is already "running on the alt stack" when the
alt stack is embedded in the normal stack like this. Just getting rid
of that might be worth trying. If so, whether the problem manifests
could be subject to timing of signal delivery (although I would not
expect that for synchronously generated signals like here).

Rich

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK
  2024-05-29 13:15 ` Rich Felker
@ 2024-05-30 10:17   ` Pablo Correa Gomez
  2024-05-30 11:51     ` Markus Wichmann
  0 siblings, 1 reply; 4+ messages in thread
From: Pablo Correa Gomez @ 2024-05-30 10:17 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

Hi Rich, thanks a lot for your reply

El mie, 29-05-2024 a las 09:15 -0400, Rich Felker escribió:
> On Wed, May 29, 2024 at 02:04:25PM +0200, Pablo Correa Gomez wrote:
> > Hi everybody,
> > 
> > I am responsible for musl CI in GNOME's GLib, and we have recently
> > bumped into a crash that I have been unable to resolve. 
> > 
> > https://gitlab.gnome.org/GNOME/glib/-
> > /commit/137db219a7266300ffde1aa75d781284fb0807cb
> > introduced in GLib an alternate stack by setting the signal action
> > SA_ONSTACK if available. However, the tests that were introduced,
> > and
> > that pass in most other libc's (there's CI for a lot more than just
> > glibc and musl) crash in my alpine linux edge installation with
> > SIGSEGV
> > (stack trace below) while doing: kill (getpid(), SIGHUP)
> > 
> > I have verified that not adding SA_ONSTACK fixes the crash. Would
> > anybody have some pointers of what could possibly be going wrong?
> > If
> > anybody is really interested, the public issue is
> > https://gitlab.gnome.org/GNOME/glib/-/issues/3315
> > 
> > Stack trace
> > ------------
> > 
> > Thread 1 "unix" received signal SIGSEGV, Segmentation fault.
> > 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at
> > ../arch/x86_64/syscall_arch.h:21
> > warning: 21     ./arch/x86_64/syscall_arch.h: No such file or
> > directory
> > (gdb) bt
> > #0  0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at
> > ../arch/x86_64/syscall_arch.h:21
> > #1  kill (pid=17483, sig=sig@entry=1) at src/signal/kill.c:6
> > #2  0x0000555555556e96 in test_signal (signum=signum@entry=1) at
> > .../glib/tests/unix.c:534
> > #3  0x0000555555557200 in test_signal_alternate_stack (signal=1) at
> > .../glib/tests/unix.c:590
> > #4  0x00007ffff7e8f364 in test_case_run (path=<optimized out>,
> > test_run_name=0x55555555d3f0 "/glib-unix/sighup/alternate-stack",
> > tc=0x55555555db60) at ../glib/gtestutils.c:2988
> > #5  g_test_run_suite_internal (suite=suite@entry=0x55555555da70,
> > path=path@entry=0x0) at ../glib/gtestutils.c:3090
> > #6  0x00007ffff7e8f2db in g_test_run_suite_internal
> > (suite=suite@entry=0x7ffff7ffee20, path=path@entry=0x0) at
> > .../glib/gtestutils.c:3109
> > #7  0x00007ffff7e8f2db in g_test_run_suite_internal
> > (suite=suite@entry=0x7ffff7ffede0, path=path@entry=0x0) at
> > .../glib/gtestutils.c:3109
> > #8  0x00007ffff7e8f86a in g_test_run_suite
> > (suite=suite@entry=0x7ffff7ffede0) at ../glib/gtestutils.c:3189
> > #9  0x00007ffff7e8f8ea in g_test_run () at
> > ../glib/gtestutils.c:2275
> > #10 0x00005555555561f7 in main (argc=<optimized out>,
> > argv=<optimized
> > out>) at ../glib/tests/unix.c:910
> 
> Can you get a disassembly and register dump at the point of crash?

(gdb) layout asm

 0x7ffff7fa96f9 <kill+7>     movslq %esi,%rsi                         
 0x7ffff7fa96fc <kill+10>    mov    $0x3e,%eax                        
 0x7ffff7fa9701 <kill+15>    syscall                                  
>0x7ffff7fa9703 <kill+17>    mov    %rax,%rdi                         
 0x7ffff7fa9706 <kill+20>    call  0x7ffff7f7afb7 <__syscall_ret>     
 0x7ffff7fa970b <kill+25>    add    $0x8,%rsp                         
 0x7ffff7fa970f <kill+29>    ret                                      
 0x7ffff7fa9710 <killpg>     test   %edi,%edi                         
 0x7ffff7fa9712 <killpg+2>   js     0x7ffff7fa971b <killpg+11>        
 0x7ffff7fa9714 <killpg+4>   neg    %edi                              
 0x7ffff7fa9716 <killpg+6>   jmp    0x7ffff7fa96f2 <kill>             
 0x7ffff7fa971b <killpg+11>  sub    $0x8,%rsp                         
 0x7ffff7fa971f <killpg+15>  call   0x7ffff7f78bae <__errno_location> 
 0x7ffff7fa9724 <killpg+20>  movl   $0x16,(%rax)                      
 0x7ffff7fa972a <killpg+26>  mov    $0xffffffff,%eax                  
 0x7ffff7fa972f <killpg+31>  add    $0x8,%rsp                         
 0x7ffff7fa9733 <killpg+35>  ret                                      
 0x7ffff7fa9734 <psiginfo>   mov    (%rdi),%edi                       
 0x7ffff7fa9736 <psiginfo+2> jmp    0x7ffff7fa973b <psignal>          
 0x7ffff7fa973b <psignal>    push   %r15    
 0x7ffff7fa973d <psignal+2>  push   %r14                              
 0x7ffff7fa973f <psignal+4>  push   %r13                              
 0x7ffff7fa9741 <psignal+6>  lea 0x51938(%rip),%r13 # 0x7ffff7ffb080
<__stderr_FILE>
 0x7ffff7fa9748 <psignal+13> push   %r12     
 0x7ffff7fa974a <psignal+15> xor    %r12d,%r12d                       
 0x7ffff7fa974d <psignal+18> push   %rbp                              
 0x7ffff7fa974e <psignal+19> push   %rbx                              
 0x7ffff7fa974f <psignal+20> mov    %rsi,%rbx                         
 0x7ffff7fa9752 <psignal+23> sub    $0x18,%rsp                        
 0x7ffff7fa9756 <psignal+27> call   0x7ffff7fb5780 <strsignal>   

(gdb) info registers
rax            0x0                 0
rbx            0x7ffff7f55c30      140737353440304
rcx            0x7ffff7fa9703      140737353783043
rdx            0x0                 0
rsi            0x1                 1
rdi            0x525e              21086
rbp            0x1                 0x1
rsp            0x7fffffffd5d0      0x7fffffffd5d0
r8             0x0                 0
r9             0x80                128
r10            0x8                 8
r11            0x202               514
r12            0x7ffff7ffdb5c      140737354128220
r13            0x1                 1
r14            0x7fffffffd6d0      140737488344784
r15            0x7fffffffd6f0      140737488344816
rip            0x7ffff7fa9703      0x7ffff7fa9703 <kill+17>
eflags         0x202               [ IF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
fs_base        0x7ffff7ffdb28      140737354128168
gs_base        0x0                 0

Does this tell you anything?
       
> I'm not sure if the crashing code is running on the signal stack or
> main stack, but here's a thought: is it possible the CI machines are
> running on a cpu/kernel with some monster AVX512 or whatever
> extension
> enabled with register file that doesn't fit in MINSIGSTKSZ?

That might be the case. Would explain why I could not reproduce in my
9-year old laptop I was running last month, but I can reproduce it now
in a new machine with a 13th Gen Intel(R) Core(TM) i7-1360P

>  If so,
> using sysconf(_SC_MINSIGSTKSZ) (conditional on _SC_MINSIGSTKSZ being
> defined) to allocate the alt stack should mitigate the problem. If
> doing this, it should probably be allocated by mmap or malloc, since
> in principle it could be too large for the caller's stack.
> 

I'll forward this to the maintainers, let's see if we can come up with
a solution. Thanks a lot for your feedback!

> It's also possible that the kernel may have some weird behavior
> deciding if the task is already "running on the alt stack" when the
> alt stack is embedded in the normal stack like this. Just getting rid
> of that might be worth trying. If so, whether the problem manifests
> could be subject to timing of signal delivery (although I would not
> expect that for synchronously generated signals like here).
> 
> Rich


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK
  2024-05-30 10:17   ` Pablo Correa Gomez
@ 2024-05-30 11:51     ` Markus Wichmann
  0 siblings, 0 replies; 4+ messages in thread
From: Markus Wichmann @ 2024-05-30 11:51 UTC (permalink / raw)
  To: musl; +Cc: Pablo Correa Gomez

Am Thu, May 30, 2024 at 12:17:59PM +0200 schrieb Pablo Correa Gomez:
> El mie, 29-05-2024 a las 09:15 -0400, Rich Felker escribió:
> > On Wed, May 29, 2024 at 02:04:25PM +0200, Pablo Correa Gomez wrote:
> > > Thread 1 "unix" received signal SIGSEGV, Segmentation fault.
> > > 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at
> > > ../arch/x86_64/syscall_arch.h:21
> (gdb) layout asm
>
>  0x7ffff7fa96f9 <kill+7>     movslq %esi,%rsi
>  0x7ffff7fa96fc <kill+10>    mov    $0x3e,%eax
>  0x7ffff7fa9701 <kill+15>    syscall
> >0x7ffff7fa9703 <kill+17>    mov    %rax,%rdi
> [...]
> Does this tell you anything?
>

It tells me that Rich's reasoning was correct. I'll explain further
down.

> > I'm not sure if the crashing code is running on the signal stack or
> > main stack, but here's a thought: is it possible the CI machines are
> > running on a cpu/kernel with some monster AVX512 or whatever
> > extension
> > enabled with register file that doesn't fit in MINSIGSTKSZ?
>
> That might be the case. Would explain why I could not reproduce in my
> 9-year old laptop I was running last month, but I can reproduce it now
> in a new machine with a 13th Gen Intel(R) Core(TM) i7-1360P
>

That is exactly what the program is doing, according to the link you
provided in the OP.

> > It's also possible that the kernel may have some weird behavior
> > deciding if the task is already "running on the alt stack" when the
> > alt stack is embedded in the normal stack like this. Just getting rid
> > of that might be worth trying. If so, whether the problem manifests
> > could be subject to timing of signal delivery (although I would not
> > expect that for synchronously generated signals like here).
> >

Thankfully, we needn't speculate, as Linux is open source. The function
get_sigframe() will determine if the thread is currently executing on
the signal stack. It does that by determining that the sp is between
stack base and stack top. If that isn't the case, it will allocate a red
zone, else it will start at the top of the altstack. It will then try to
allocate a full frame. If that doesn't work (because it already was on
an altstack that got overflowed, or it tried to enter too small of an
altstack), then it will generate a message "overflowed sigaltstack",
that you might find in dmesg, before returning a bogus address.

Due to the bogus address, all calls to unsafe_put_user() in
x64_setup_rt_frame() will fail, and it will return EFAULT. This error
will be reported to signal_setup_done() and it will call
force_sigsegv(), which then reports a SIGSEGV at the "current" IP. Since
this all happens during a syscall, the current IP is the one directly
following the syscall instruction.

> > Rich
>

Ciao,
Markus

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-05-30 11:51 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-29 12:04 [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK Pablo Correa Gomez
2024-05-29 13:15 ` Rich Felker
2024-05-30 10:17   ` Pablo Correa Gomez
2024-05-30 11:51     ` Markus Wichmann

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).