mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] [Bug] Do not ignore membarrier return code
@ 2020-03-23 16:10 Julio Guerra
  2020-03-23 16:38 ` Rich Felker
  0 siblings, 1 reply; 8+ messages in thread
From: Julio Guerra @ 2020-03-23 16:10 UTC (permalink / raw)
  To: musl; +Cc: Vladimir de Turckheim

[-- Attachment #1: Type: text/plain, Size: 698 bytes --]

Hello,

The implementation of dlopen() uses membarrier() (
https://git.musl-libc.org/cgit/musl/tree/ldso/dynlink.c#n1579) while
currently forbidden by the default docker seccomp profile.

I perfectly understand that it's on docker's end and I suggested them to
add it in this PR <https://github.com/moby/moby/pull/40731> but such a
critical syscall shouldn't be silently ignored. And it for example leads to
random segfaults on nodejs. I also saw opened qemu issues related to
membarrier + alpine.

dlopen() should therefore fail when membarrier fails (ie. in this case
when __membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED,
0) != 0).

Happy to provide the patch if agreed!
All the best,
Julio Guerra

[-- Attachment #2: Type: text/html, Size: 1764 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] [Bug] Do not ignore membarrier return code
  2020-03-23 16:10 [musl] [Bug] Do not ignore membarrier return code Julio Guerra
@ 2020-03-23 16:38 ` Rich Felker
  2020-03-24 13:20   ` Julio Guerra
  0 siblings, 1 reply; 8+ messages in thread
From: Rich Felker @ 2020-03-23 16:38 UTC (permalink / raw)
  To: musl

On Mon, Mar 23, 2020 at 05:10:40PM +0100, Julio Guerra wrote:
> Hello,
> 
> The implementation of dlopen() uses membarrier() (
> https://git.musl-libc.org/cgit/musl/tree/ldso/dynlink.c#n1579) while
> currently forbidden by the default docker seccomp profile.
> 
> I perfectly understand that it's on docker's end and I suggested them to
> add it in this PR <https://github.com/moby/moby/pull/40731> but such a
> critical syscall shouldn't be silently ignored. And it for example leads to
> random segfaults on nodejs. I also saw opened qemu issues related to
> membarrier + alpine.
> 
> dlopen() should therefore fail when membarrier fails (ie. in this case
> when __membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED,
> 0) != 0).

At that point it's past the point where failure is possible; making it
possible would be rather nontrivial. But you missed that it can't
fail. musl has a very heavy fallback implemementation for the case
where it's not implemented or somehow fails; see
src/linux/membarrier.c.

However, the reason you're seeing the failure is something of a bug in
musl -- registration of intent to use membarrier is only done on first
pthread_create. That's okay because it's only needed at all if the
process is multithreaded. However, dlopen is calling it
unconditionally even if the process is not multithreaded, and thereby
getting a spurious failure since it wasn't registered yet. It should
just be fixed not to make the fall if it's not multithreaded.

Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] [Bug] Do not ignore membarrier return code
  2020-03-23 16:38 ` Rich Felker
@ 2020-03-24 13:20   ` Julio Guerra
  2020-03-24 13:53     ` Rich Felker
  2020-03-24 13:53     ` Julio Guerra
  0 siblings, 2 replies; 8+ messages in thread
From: Julio Guerra @ 2020-03-24 13:20 UTC (permalink / raw)
  To: musl

Hello Rich,

Here are more details on what we did to reproduce the issue.
You can clone this gist
https://gist.github.com/vdeturckheim/d420310e272f525824d7e92e7e875024
and have a look at the run.sh file example in order to get started
with it. The test.js file does a require of the js bindings of grpc,
which involves the dlopen.

What we observed yesterday with this example was:
- It crashed approximately 9 times out of 10 on aws codebuild with the
machine BUILD_GENERAL1_SMALL (3 GB memory, 2 vCPUs).
- It worked all the time by only adding membarrier to the seccomp
profile of the docker run.

But I wanted to give you more details with stack traces of the
segfault by retrying today with gdb but I cannot reproduce it
anymore...!
I'll retry later to see if I see the error again...

If what you say about membarrier is true, I think there may be some
synchronization side-effect of the syscall since, afaik, node uses
threads in order to load the shared libraries in the libuv.

-- 
Julio

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] [Bug] Do not ignore membarrier return code
  2020-03-24 13:20   ` Julio Guerra
@ 2020-03-24 13:53     ` Rich Felker
  2020-03-24 13:53     ` Julio Guerra
  1 sibling, 0 replies; 8+ messages in thread
From: Rich Felker @ 2020-03-24 13:53 UTC (permalink / raw)
  To: musl

On Tue, Mar 24, 2020 at 02:20:08PM +0100, Julio Guerra wrote:
> Hello Rich,
> 
> Here are more details on what we did to reproduce the issue.
> You can clone this gist
> https://gist.github.com/vdeturckheim/d420310e272f525824d7e92e7e875024
> and have a look at the run.sh file example in order to get started
> with it. The test.js file does a require of the js bindings of grpc,
> which involves the dlopen.
> 
> What we observed yesterday with this example was:
> - It crashed approximately 9 times out of 10 on aws codebuild with the
> machine BUILD_GENERAL1_SMALL (3 GB memory, 2 vCPUs).
> - It worked all the time by only adding membarrier to the seccomp
> profile of the docker run.
> 
> But I wanted to give you more details with stack traces of the
> segfault by retrying today with gdb but I cannot reproduce it
> anymore...!
> I'll retry later to see if I see the error again...
> 
> If what you say about membarrier is true, I think there may be some
> synchronization side-effect of the syscall since, afaik, node uses
> threads in order to load the shared libraries in the libuv.

My best guess, especially since the crash was unpredictable, is that
the stack size on at least one thread is barely sufficient for what
it's doing. The fallback path when the membarrier syscall fails
requires the ability to deliver signals, and if any thread has
insufficient space left on its stack to accept a signal, it will
crash.

Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] [Bug] Do not ignore membarrier return code
  2020-03-24 13:20   ` Julio Guerra
  2020-03-24 13:53     ` Rich Felker
@ 2020-03-24 13:53     ` Julio Guerra
  2020-03-24 14:01       ` Rich Felker
  1 sibling, 1 reply; 8+ messages in thread
From: Julio Guerra @ 2020-03-24 13:53 UTC (permalink / raw)
  To: musl

> But I wanted to give you more details with stack traces of the
> segfault by retrying today with gdb but I cannot reproduce it
> anymore...!
> I'll retry later to see if I see the error again...

The node:10.19-alpine image has been updated since yesterday! That may
be why I don't have the issue anymore? It updated the latest
alpine:3.11 image, so 3.11.5.
The changelog talks says "main/musl: backport fixes from 1.2.0":
https://www.alpinelinux.org/posts/Alpine-3.11.5-released.html

-- 
Julio

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] [Bug] Do not ignore membarrier return code
  2020-03-24 13:53     ` Julio Guerra
@ 2020-03-24 14:01       ` Rich Felker
  2020-03-24 16:08         ` A. Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: Rich Felker @ 2020-03-24 14:01 UTC (permalink / raw)
  To: musl

On Tue, Mar 24, 2020 at 02:53:21PM +0100, Julio Guerra wrote:
> > But I wanted to give you more details with stack traces of the
> > segfault by retrying today with gdb but I cannot reproduce it
> > anymore...!
> > I'll retry later to see if I see the error again...
> 
> The node:10.19-alpine image has been updated since yesterday! That may
> be why I don't have the issue anymore? It updated the latest
> alpine:3.11 image, so 3.11.5.
> The changelog talks says "main/musl: backport fixes from 1.2.0":
> https://www.alpinelinux.org/posts/Alpine-3.11.5-released.html

I'm not aware of any changes that are at all related to your problem.

Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] [Bug] Do not ignore membarrier return code
  2020-03-24 14:01       ` Rich Felker
@ 2020-03-24 16:08         ` A. Wilcox
  2020-03-24 17:42           ` Rich Felker
  0 siblings, 1 reply; 8+ messages in thread
From: A. Wilcox @ 2020-03-24 16:08 UTC (permalink / raw)
  To: musl


[-- Attachment #1.1: Type: text/plain, Size: 1247 bytes --]

On 24/03/2020 09:01, Rich Felker wrote:
> On Tue, Mar 24, 2020 at 02:53:21PM +0100, Julio Guerra wrote:
>>> But I wanted to give you more details with stack traces of the
>>> segfault by retrying today with gdb but I cannot reproduce it
>>> anymore...!
>>> I'll retry later to see if I see the error again...
>>
>> The node:10.19-alpine image has been updated since yesterday! That may
>> be why I don't have the issue anymore? It updated the latest
>> alpine:3.11 image, so 3.11.5.
>> The changelog talks says "main/musl: backport fixes from 1.2.0":
>> https://www.alpinelinux.org/posts/Alpine-3.11.5-released.html
> 
> I'm not aware of any changes that are at all related to your problem.
> 
> Rich

My first guess is something to do with the really egregious inspector
signal stack bug.  Our patch for Node 10.x is at:

https://code.foxkit.us/adelie/packages/blob/master/user/node/stack-silliness.patch

Alpine doesn't patch it for Node 12 because Node 12 has this code to
purportedly "handle musl":

https://github.com/nodejs/node/blob/master/src/inspector_agent.cc#L112

But I still feel like it may be wrong.

Best,
--arw

-- 
A. Wilcox (awilfox)
Project Lead, Adélie Linux
https://www.adelielinux.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] [Bug] Do not ignore membarrier return code
  2020-03-24 16:08         ` A. Wilcox
@ 2020-03-24 17:42           ` Rich Felker
  0 siblings, 0 replies; 8+ messages in thread
From: Rich Felker @ 2020-03-24 17:42 UTC (permalink / raw)
  To: musl

On Tue, Mar 24, 2020 at 11:08:10AM -0500, A. Wilcox wrote:
> On 24/03/2020 09:01, Rich Felker wrote:
> > On Tue, Mar 24, 2020 at 02:53:21PM +0100, Julio Guerra wrote:
> >>> But I wanted to give you more details with stack traces of the
> >>> segfault by retrying today with gdb but I cannot reproduce it
> >>> anymore...!
> >>> I'll retry later to see if I see the error again...
> >>
> >> The node:10.19-alpine image has been updated since yesterday! That may
> >> be why I don't have the issue anymore? It updated the latest
> >> alpine:3.11 image, so 3.11.5.
> >> The changelog talks says "main/musl: backport fixes from 1.2.0":
> >> https://www.alpinelinux.org/posts/Alpine-3.11.5-released.html
> > 
> > I'm not aware of any changes that are at all related to your problem.
> > 
> > Rich
> 
> My first guess is something to do with the really egregious inspector
> signal stack bug.  Our patch for Node 10.x is at:
> 
> https://code.foxkit.us/adelie/packages/blob/master/user/node/stack-silliness.patch
> 
> Alpine doesn't patch it for Node 12 because Node 12 has this code to
> purportedly "handle musl":
> 
> https://github.com/nodejs/node/blob/master/src/inspector_agent.cc#L112
> 
> But I still feel like it may be wrong.

I think this may be the cause. I don't think Alpine ever had the patch
Adélie had to fix this bug in node, but 3.11.5 has Node 12.15.0 which
has the above fix. It looks like Alpine 3.11.0 has 12.14.0 which
should also have the fix though...

Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-03-24 17:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-23 16:10 [musl] [Bug] Do not ignore membarrier return code Julio Guerra
2020-03-23 16:38 ` Rich Felker
2020-03-24 13:20   ` Julio Guerra
2020-03-24 13:53     ` Rich Felker
2020-03-24 13:53     ` Julio Guerra
2020-03-24 14:01       ` Rich Felker
2020-03-24 16:08         ` A. Wilcox
2020-03-24 17:42           ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).