From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/10992 Path: news.gmane.org!.POSTED!not-for-mail From: Eric Hassold Newsgroups: gmane.linux.lib.musl.general Subject: Re: Fix pthread_create on some devices failing to initialize guard area Date: Mon, 30 Jan 2017 13:30:00 -0800 Message-ID: <81f188cf-3fb8-2899-5c24-ec72d38ad300@gmail.com> References: <20170120195649.GS1533@brightrain.aerifal.cx> <30588c41-eb3d-627e-c5eb-91e19ef56790@gmail.com> <20170120212933.GT1533@brightrain.aerifal.cx> <80ab9b97-dc0d-e642-cbf1-1b20e1cddf64@gmail.com> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Trace: blaine.gmane.org 1485811818 20783 195.159.176.226 (30 Jan 2017 21:30:18 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Mon, 30 Jan 2017 21:30:18 +0000 (UTC) User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0) Gecko/20100101 Thunderbird/45.7.0 To: musl@lists.openwall.com Original-X-From: musl-return-11007-gllmg-musl=m.gmane.org@lists.openwall.com Mon Jan 30 22:30:13 2017 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1cYJWf-0005Af-0m for gllmg-musl@m.gmane.org; Mon, 30 Jan 2017 22:30:13 +0100 Original-Received: (qmail 31955 invoked by uid 550); 30 Jan 2017 21:30:16 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 31922 invoked from network); 30 Jan 2017 21:30:15 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-transfer-encoding; bh=G++7SMsAchMDYDCeM4EXDPGvyfqiG9i7+UBqPYvZkGY=; b=nEFBabC8Z5lCl1YvemjszXlZ6y98atMuXOeac6hS5t3ExdQSJRs5TkeGm8p2SV90mP 7VM6wBbo64S+YlLoi7pev+8BlRX7tR+x+NGkCyluVqSAQvY7XYDUYrhYxCCQbXpTjoUx ingxKhK9XcH26gbLvYKP7Vsq+4diQYrOgvXjH2K566rx2D2iXmY/NOWVNXytblCVB9s3 h86RrlOfOcvFy5WYGBfQ6xOLi2t7yBjoH0uuMSHQqrbU69l1drdcKz8ql7REni3DXBRN lpDogM1TwXUZpWHk14D2V95PMRfWlRkTror5Ca9fv22hVpu093Epo4W3kjGy8jA+rY2P dWJQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding; bh=G++7SMsAchMDYDCeM4EXDPGvyfqiG9i7+UBqPYvZkGY=; b=tVTJ8jmGPC8FmCOZkBTWfVAmhABxfSTdUSpmsvsfgbYrxE+60pzeQrq/WObpYBDBxn 1tyDDBAEZbgeUpYZCc3+FD1jRpUvbmXlWDuWA8JxAnV8pMBe4WrJBx3BXiY9OahqkHTo LidEIuEz+DcVDpfDKVqjD/12p+479LHZt+WFCXwbPWGg6QglI9R3VsvP//SrxUcvJtv6 JsM9fqFq7NDZA1hI/Pxu9Nak6TgMaO59vusFNc1rVlO++/QI3UWW9NR3m8mLr0N1C+PZ B0N53jDsi1WabbxuVES6iIE8Hhl7PjzFCciMY6BKCEDton6p1UJequnBuIJSPlh4yETg SPqw== X-Gm-Message-State: AIkVDXIkkWkvNLIs8YBqIq84HuAMNIhSsgt8QcJ5N9SLeyJNtQHPQKTh9lfeFLowdKr6nQ== X-Received: by 10.99.174.4 with SMTP id q4mr26533964pgf.186.1485811802358; Mon, 30 Jan 2017 13:30:02 -0800 (PST) In-Reply-To: <80ab9b97-dc0d-e642-cbf1-1b20e1cddf64@gmail.com> Xref: news.gmane.org gmane.linux.lib.musl.general:10992 Archived-At: On 1/20/17 2:42 PM, Eric Hassold wrote: > > > On 1/20/17 1:29 PM, Rich Felker wrote: >> On Fri, Jan 20, 2017 at 01:04:28PM -0800, Eric Hassold wrote: >>> On 1/20/17 11:56 AM, Rich Felker wrote: >>>> On Fri, Jan 20, 2017 at 11:45:09AM -0800, Eric Hassold wrote: >>>>> Hi All, >>>>> >>>>> While deploying test static executable across farm of different >>>>> embedded systems, found out that pthread_create() is failing >>>>> systematically on some (very few) arm-linux devices whenever non >>>>> null stack guard is enabled (that is, also when calling >>>>> pthread_create with default - i.e. null - attributes since default >>>>> is a one page of guard). One of those device is for example a >>>>> Marvell Armada 375 running Linux 3.10.39. Same test code, built with >>>>> alternative libc implementations (glibc, uClibc) works as expected >>>>> on those devices. >>>>> >>>>> >>>>> Issue >>>>> >>>>> This occurs because of call to mprotect() in pthread_create fails. >>>>> In current implementation, if guard size is non null, memory for >>>>> (guard + stack + ...) is first allocated (mmap'ed) with no >>>>> accessibility (PROT_NONE), then mprotect() is called to re-enable >>>>> read/write access to (memory + guardsize). Since call to mprotect() >>>>> systematically fails in this scenario (returning error code EINVAL), >>>>> it is impossible to create thread. >>>> Failure is ignored and the memory is assumed to be writable in this >>>> case, since EINVAL is assumed to imply no MMU. Is this assumption >>>> wrong in your case, and if so, can you explain why? >>> In my case, devices exhibiting issue are not MMU-less, they are >>> Cortex-A9 devices with valid mmu / page protection working as >>> expected otherwise. Note that current Musl code assumes ENOSYS means >>> no MMU and handles it by assuming the system has no page protection >>> at all. For the case I observe, it is EINVAL which is returned, this >>> is not ignored, so memory is unmap'ed and pthread_create() fails. >> In that case I think this is a kernel bug. Do you know why EINVAL is >> happening? If there's an MMU, Linux should be able to replace the >> anon PROT_NONE pages with anon RW pages. > > Agree. Unfortunately, those are devices we don't built the kernel for, > so have been hardly able to track issue deeper. The point is however > that such devices with this issue in kernel might not be that > uncommon, and it concretely means impossibility at that moment to > deploy on them a functional static executable built with musl. >> >>>>> In proposed patch (attached below), memory for (guard + stack + ...) >>>>> is first mmap'ed with read/write accessibility, then guard area is >>>>> protected by calling mprotect() with PROT_NONE on guardsize first >>>>> bytes of returned memory. This call to mprotect() to remove all >>>>> accessibility on guard area, with guard area being at beginning of >>>>> previously mmap'ed memory, works correctly on those platforms having >>>>> issue with current implementation. Incidentally, this makes the >>>>> logic more concise to handle both cases (with or without guard) is a >>>>> more consistent way, and handle systems with partial/invalid page >>>>> protection implementation (e.g. mprotect() returning ENOSYS) more >>>>> gracefully since the stack is explicitly created with read/write >>>>> access. >>>> This doesn't work correctly on normal systems with mmu, because the >>>> size of the guard pages is accounted against commit charge. Linux >>>> should, but AFAIK doesn't, subtract it from commit charge once it's >>>> changed to PROT_NONE without having been dirtied, but even if this bug >>>> is fixed on the kernel side, there would still be a moment where >>>> excess commit charge is consumed and thus where pthread_create might >>>> spuriously fail or cause allocations in other processes/threads to >>>> fail. >>>> >>>> If the kernel is not allocating actually-usable address ranges for >>>> PROT_NONE on all nommu systems, I think the only solution is to handle >>>> EINVAL from mprotect by going back and re-doing the mmap with >>>> PROT_READ|PROT_WRITE. Do you have any better ideas? >>>> >>>> Rich >>> Had this "deja vu" feeling... reminds me conversation you had in >>> this thread some time ago elsewhere... >>> https://sourceware.org/ml/libc-alpha/2015-09/msg00447.html >>> >>> Your proposition seems reasonable on nommu system, but again, the >>> issue observed here is on legit systems with mmu, with mprotect >>> failing with EINVAL (and not ENOSYS), for some other reason than >>> system not supporting page protection. Catching EINVAL error >>> returned by mprotect and falling back to re-doing the mmap would >>> mean actually silently running without stack guard on system >>> supporting it, so I believe it is actually legitimate to fail and >>> return error in that case. But that's difference use case than the >>> issue I'm observing. >>> >>> I took note of Balazs's suggestion (in the thread referenced above) >>> to switch to a pattern similar to Musl's current one >>> (mmap(PROT_NONE) + mprotect(stack, READ|WRITE)) in order to avoid >>> those guard pages to actually occupy resources. But I can indeed >>> observe that this approach fails on some devices (which have valid >>> mmu), while I'm not sure I'm seeing the issue with first mapping >>> PROT_READ|PROT_WRITE then mprotect(PROT_NONE) guard area. Latter >>> approach (as implemented by patch) is, at least, consistent with all >>> the other implementations out there (I checked glibc's >>> allocatestack.c, but also e.g. bionic), and couldn't find report of >>> those failures you are envisaging. >> Consider the case of guard_size=128M stack_size=128k with >> Commit_Limit=128M. This will fail with your approach but works >> perfectly well now. > Right, makes sense. Though I would point out that such uncommon > scenario for an application would fail when linked with anything but > musl, since "my approach" is the widespread one across all other libc > implementations (glibc, bionic, ...). But I understand both approaches > aim at working around issue in some corner case while creating issue > in other rare one, and so at the end it's all about deciding which > edge case is more critical than the other, and whether > consistency/compatibility with existing 3rd parties solutions matters > or not. > > But in order to get closer to "having the cake and eat it too", please > find attached another patch implementing the "unmaping and re-doing" > strategy you initially suggested, i.e. starting with current approach, > then giving it a second chance using "my" approach if and only if > issue with mprotect() is detected. Should be consistent with current > behavior on any system working correctly today, and just provide a > plan B falling back to more common approach only when needed. Does > that sound more acceptable? > > Thanks, > > Eric > > ---- > [PATCH] set up guard protection after stack is mapped > > calling mprotect beyond the guard page of PROT_NONE mmap-ed stack > fails on > some devices with buggy kernel, making it impossible to create > thread. if such condition is met, fall back to allocating memory > for stack and guard first, with read and write permission, then > protect guard > area. > --- > src/thread/pthread_create.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/src/thread/pthread_create.c b/src/thread/pthread_create.c > index 49f2f72..d3c030b 100644 > --- a/src/thread/pthread_create.c > +++ b/src/thread/pthread_create.c > @@ -242,7 +242,17 @@ int __pthread_create(pthread_t *restrict res, > const pthread_attr_t *restrict att > if (__mprotect(map+guard, size-guard, PROT_READ|PROT_WRITE) > && errno != ENOSYS) { > __munmap(map, size); > - goto fail; > + if (errno == EINVAL) { > + map = __mmap(0, size, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANON, -1, 0); > + if (map == MAP_FAILED) goto fail; > + if (__mprotect(map, guard, PROT_NONE) > + && errno != ENOSYS) { > + __munmap(map, size); > + goto fail; > + } > + } else { > + goto fail; > + } > } > } else { > map = __mmap(0, size, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANON, -1, 0); Hi Rich, Pinging... any comment, feedback or concern about latest version of the patch, attached above, keeping current behavior but falling back to (mmap(PROT_READ|PROT_WRITE) && mprotect(guard, none)) if and only if current approach detected to fail) ? Eric