From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/10992
Path: news.gmane.org!.POSTED!not-for-mail
From: Eric Hassold <hassold@gmail.com>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: Fix pthread_create on some devices failing to initialize
 guard area
Date: Mon, 30 Jan 2017 13:30:00 -0800
Message-ID: <81f188cf-3fb8-2899-5c24-ec72d38ad300@gmail.com>
References: <dc9ff861-1893-10fe-07d9-333310fc56bc@gmail.com>
 <20170120195649.GS1533@brightrain.aerifal.cx>
 <30588c41-eb3d-627e-c5eb-91e19ef56790@gmail.com>
 <20170120212933.GT1533@brightrain.aerifal.cx>
 <80ab9b97-dc0d-e642-cbf1-1b20e1cddf64@gmail.com>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: blaine.gmane.org 1485811818 20783 195.159.176.226 (30 Jan 2017 21:30:18 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Mon, 30 Jan 2017 21:30:18 +0000 (UTC)
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0)
 Gecko/20100101 Thunderbird/45.7.0
To: musl@lists.openwall.com
Original-X-From: musl-return-11007-gllmg-musl=m.gmane.org@lists.openwall.com Mon Jan 30 22:30:13 2017
Return-path: <musl-return-11007-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by blaine.gmane.org with smtp (Exim 4.84_2)
	(envelope-from <musl-return-11007-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1cYJWf-0005Af-0m
	for gllmg-musl@m.gmane.org; Mon, 30 Jan 2017 22:30:13 +0100
Original-Received: (qmail 31955 invoked by uid 550); 30 Jan 2017 21:30:16 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Original-Received: (qmail 31922 invoked from network); 30 Jan 2017 21:30:15 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=subject:to:references:from:message-id:date:user-agent:mime-version
         :in-reply-to:content-transfer-encoding;
        bh=G++7SMsAchMDYDCeM4EXDPGvyfqiG9i7+UBqPYvZkGY=;
        b=nEFBabC8Z5lCl1YvemjszXlZ6y98atMuXOeac6hS5t3ExdQSJRs5TkeGm8p2SV90mP
         7VM6wBbo64S+YlLoi7pev+8BlRX7tR+x+NGkCyluVqSAQvY7XYDUYrhYxCCQbXpTjoUx
         ingxKhK9XcH26gbLvYKP7Vsq+4diQYrOgvXjH2K566rx2D2iXmY/NOWVNXytblCVB9s3
         h86RrlOfOcvFy5WYGBfQ6xOLi2t7yBjoH0uuMSHQqrbU69l1drdcKz8ql7REni3DXBRN
         lpDogM1TwXUZpWHk14D2V95PMRfWlRkTror5Ca9fv22hVpu093Epo4W3kjGy8jA+rY2P
         dWJQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:subject:to:references:from:message-id:date
         :user-agent:mime-version:in-reply-to:content-transfer-encoding;
        bh=G++7SMsAchMDYDCeM4EXDPGvyfqiG9i7+UBqPYvZkGY=;
        b=tVTJ8jmGPC8FmCOZkBTWfVAmhABxfSTdUSpmsvsfgbYrxE+60pzeQrq/WObpYBDBxn
         1tyDDBAEZbgeUpYZCc3+FD1jRpUvbmXlWDuWA8JxAnV8pMBe4WrJBx3BXiY9OahqkHTo
         LidEIuEz+DcVDpfDKVqjD/12p+479LHZt+WFCXwbPWGg6QglI9R3VsvP//SrxUcvJtv6
         JsM9fqFq7NDZA1hI/Pxu9Nak6TgMaO59vusFNc1rVlO++/QI3UWW9NR3m8mLr0N1C+PZ
         B0N53jDsi1WabbxuVES6iIE8Hhl7PjzFCciMY6BKCEDton6p1UJequnBuIJSPlh4yETg
         SPqw==
X-Gm-Message-State: AIkVDXIkkWkvNLIs8YBqIq84HuAMNIhSsgt8QcJ5N9SLeyJNtQHPQKTh9lfeFLowdKr6nQ==
X-Received: by 10.99.174.4 with SMTP id q4mr26533964pgf.186.1485811802358;
        Mon, 30 Jan 2017 13:30:02 -0800 (PST)
In-Reply-To: <80ab9b97-dc0d-e642-cbf1-1b20e1cddf64@gmail.com>
Xref: news.gmane.org gmane.linux.lib.musl.general:10992
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/10992>


On 1/20/17 2:42 PM, Eric Hassold wrote:
>
>
> On 1/20/17 1:29 PM, Rich Felker wrote:
>> On Fri, Jan 20, 2017 at 01:04:28PM -0800, Eric Hassold wrote:
>>> On 1/20/17 11:56 AM, Rich Felker wrote:
>>>> On Fri, Jan 20, 2017 at 11:45:09AM -0800, Eric Hassold wrote:
>>>>> Hi All,
>>>>>
>>>>> While deploying test static executable across farm of different
>>>>> embedded systems, found out that pthread_create() is failing
>>>>> systematically on some (very few) arm-linux devices whenever non
>>>>> null stack guard is enabled (that is, also when calling
>>>>> pthread_create with default - i.e. null - attributes since default
>>>>> is a one page of guard). One of those device is for example a
>>>>> Marvell Armada 375 running Linux 3.10.39. Same test code, built with
>>>>> alternative libc implementations (glibc, uClibc) works as expected
>>>>> on those devices.
>>>>>
>>>>>
>>>>> Issue
>>>>>
>>>>> This occurs because of call to mprotect() in pthread_create fails.
>>>>> In current implementation, if guard size is non null, memory for
>>>>> (guard + stack + ...) is first allocated (mmap'ed) with no
>>>>> accessibility (PROT_NONE), then mprotect() is called to re-enable
>>>>> read/write access to (memory + guardsize). Since call to mprotect()
>>>>> systematically fails in this scenario (returning error code EINVAL),
>>>>> it is impossible to create thread.
>>>> Failure is ignored and the memory is assumed to be writable in this
>>>> case, since EINVAL is assumed to imply no MMU. Is this assumption
>>>> wrong in your case, and if so, can you explain why?
>>> In my case, devices exhibiting issue are not MMU-less, they are
>>> Cortex-A9 devices with valid mmu / page protection working as
>>> expected otherwise. Note that current Musl code assumes ENOSYS means
>>> no MMU and handles it by assuming the system has no page protection
>>> at all. For the case I observe, it is EINVAL which is returned, this
>>> is not ignored, so memory is unmap'ed and pthread_create() fails.
>> In that case I think this is a kernel bug. Do you know why EINVAL is
>> happening? If there's an MMU, Linux should be able to replace the
>> anon PROT_NONE pages with anon RW pages.
>
> Agree. Unfortunately, those are devices we don't built the kernel for, 
> so have been hardly able to track issue deeper. The point is however 
> that such devices with this issue in kernel might not be that 
> uncommon, and it concretely means impossibility at that moment to 
> deploy on them a functional static executable built with musl.
>>
>>>>> In proposed patch (attached below), memory for (guard + stack + ...)
>>>>> is first mmap'ed with read/write accessibility, then guard area is
>>>>> protected by calling mprotect() with PROT_NONE on guardsize first
>>>>> bytes of returned memory. This call to mprotect() to remove all
>>>>> accessibility on guard area, with guard area being at beginning of
>>>>> previously mmap'ed memory, works correctly on those platforms having
>>>>> issue with current implementation. Incidentally, this makes the
>>>>> logic more concise to handle both cases (with or without guard) is a
>>>>> more consistent way, and handle systems with partial/invalid page
>>>>> protection implementation (e.g. mprotect() returning ENOSYS) more
>>>>> gracefully since the stack is explicitly created with read/write
>>>>> access.
>>>> This doesn't work correctly on normal systems with mmu, because the
>>>> size of the guard pages is accounted against commit charge. Linux
>>>> should, but AFAIK doesn't, subtract it from commit charge once it's
>>>> changed to PROT_NONE without having been dirtied, but even if this bug
>>>> is fixed on the kernel side, there would still be a moment where
>>>> excess commit charge is consumed and thus where pthread_create might
>>>> spuriously fail or cause allocations in other processes/threads to
>>>> fail.
>>>>
>>>> If the kernel is not allocating actually-usable address ranges for
>>>> PROT_NONE on all nommu systems, I think the only solution is to handle
>>>> EINVAL from mprotect by going back and re-doing the mmap with
>>>> PROT_READ|PROT_WRITE. Do you have any better ideas?
>>>>
>>>> Rich
>>> Had this "deja vu" feeling... reminds me conversation you had in
>>> this thread some time ago elsewhere...
>>> https://sourceware.org/ml/libc-alpha/2015-09/msg00447.html
>>>
>>> Your proposition seems reasonable on nommu system, but again, the
>>> issue observed here is on legit systems with mmu, with mprotect
>>> failing with EINVAL (and not ENOSYS), for some other reason than
>>> system not supporting page protection. Catching EINVAL error
>>> returned by mprotect and falling back to re-doing the mmap would
>>> mean actually silently running without stack guard on system
>>> supporting it, so I believe it is actually legitimate to fail and
>>> return error in that case. But that's difference use case than the
>>> issue I'm observing.
>>>
>>> I took note of Balazs's suggestion (in the thread referenced above)
>>> to switch to a pattern similar to Musl's current one
>>> (mmap(PROT_NONE) + mprotect(stack, READ|WRITE)) in order to avoid
>>> those guard pages to actually occupy resources. But I can indeed
>>> observe that this approach fails on some devices (which have valid
>>> mmu), while I'm not sure I'm seeing the issue with first mapping
>>> PROT_READ|PROT_WRITE then mprotect(PROT_NONE) guard area. Latter
>>> approach (as implemented by patch) is, at least, consistent with all
>>> the other implementations out there (I checked glibc's
>>> allocatestack.c, but also e.g. bionic), and couldn't find report of
>>> those failures you are envisaging.
>> Consider the case of guard_size=128M stack_size=128k with
>> Commit_Limit=128M. This will fail with your approach but works
>> perfectly well now.
> Right, makes sense. Though I would point out that such uncommon 
> scenario for an application would fail when linked with anything but 
> musl, since "my approach" is the widespread one across all other libc 
> implementations (glibc, bionic, ...). But I understand both approaches 
> aim at working around issue in some corner case while creating issue 
> in other rare one, and so at the end it's all about deciding which 
> edge case is more critical than the other, and whether 
> consistency/compatibility with existing 3rd parties solutions matters 
> or not.
>
> But in order to get closer to "having the cake and eat it too", please 
> find attached another patch implementing the "unmaping and re-doing" 
> strategy you initially suggested, i.e. starting with current approach, 
> then giving it a second chance using "my" approach if and only if 
> issue with mprotect() is detected. Should be consistent with current 
> behavior on any system working correctly today, and just provide a 
> plan B falling back to more common approach only when needed. Does 
> that sound more acceptable?
>
> Thanks,
>
> Eric
>
> ----
> [PATCH] set up guard protection after stack is mapped
>
> calling mprotect beyond the guard page of PROT_NONE mmap-ed stack 
> fails on
> some devices with buggy kernel, making it impossible to create
> thread. if such condition is met, fall back to allocating memory
> for stack and guard first, with read and write permission, then 
> protect guard
> area.
> ---
>  src/thread/pthread_create.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/src/thread/pthread_create.c b/src/thread/pthread_create.c
> index 49f2f72..d3c030b 100644
> --- a/src/thread/pthread_create.c
> +++ b/src/thread/pthread_create.c
> @@ -242,7 +242,17 @@ int __pthread_create(pthread_t *restrict res, 
> const pthread_attr_t *restrict att
>              if (__mprotect(map+guard, size-guard, PROT_READ|PROT_WRITE)
>                  && errno != ENOSYS) {
>                  __munmap(map, size);
> -                goto fail;
> +                if (errno == EINVAL) {
> +                    map = __mmap(0, size, PROT_READ|PROT_WRITE, 
> MAP_PRIVATE|MAP_ANON, -1, 0);
> +                    if (map == MAP_FAILED) goto fail;
> +                    if (__mprotect(map, guard, PROT_NONE)
> +                        && errno != ENOSYS) {
> +                        __munmap(map, size);
> +                        goto fail;
> +                                        }
> +                } else {
> +                    goto fail;
> +                                }
>              }
>          } else {
>              map = __mmap(0, size, PROT_READ|PROT_WRITE, 
> MAP_PRIVATE|MAP_ANON, -1, 0);

Hi Rich,

Pinging... any comment, feedback or concern about latest version of the 
patch, attached above, keeping current behavior but falling back to 
(mmap(PROT_READ|PROT_WRITE) && mprotect(guard, none)) if and only if 
current approach detected to fail) ?

Eric