Proposed approach for malloc to deal with failing brk

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Proposed approach for malloc to deal with failing brk
@ 2014-03-31  0:41 Rich Felker
  2014-03-31  4:32 ` Rich Felker
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Rich Felker @ 2014-03-31  0:41 UTC (permalink / raw)
  To: musl

Failure of malloc when a badly-placed VMA blocks the brk from being
expanded has been a known issue for a while, but I wasn't aware of how
bad it was breaking PIE binaries on affected systems. So now that it's
been raised again I'm looking to fix it, and I have a proposed
solution. First, some background:

We want brk. This is not because "brk is faster than mmap", but
because it takes a lot of work to replicate what brk does using mmap,
and there's no hope of making a complex dance of multiple syscalls
equally efficient. My best idea for emulating brk was to mmap a huge
PROT_NONE region and gradually mprotect it to PROT_READ|PROT_WRITE,
but it turns out this is what glibc does for per-thread arenas and
it's really slow, probably because it involves splitting one VMA and
merging into another.

So the solution is not to replicate brk. The reason we want brk
instead of mmap is to avoid pathological fragmentation: if we obtain a
new block of memory from mmap to add it to the heap, there's no
efficient way to track whether it's adjacent to another free region
which it could be merged with. But there's another solution to this
fragmentation problem: an asymptotic one. Here it goes:

Once brk has failed, begin obtaining new blocks to add to the heap via
mmap, with the size carefully chosen:

    MAX(requested_size, PAGE_SIZE<<(mmap_cnt/2))

where mmap_cnt is initially 0 and increments by 1 each time a new heap
block has to be obtained via mmap. This ensures exponential growth of
the blocks added, so that the fragmentation cost will be extremely
finite (asymptotically zero relative fragmentation) while bounding the
preallocation to roughly 50% beyond the actual amount of memory needed
so far.

Perhaps the best part is that this solution can be implemented in just
a few lines of code.

Rich

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Proposed approach for malloc to deal with failing brk
  2014-03-31  0:41 Proposed approach for malloc to deal with failing brk Rich Felker
@ 2014-03-31  4:32 ` Rich Felker
  2014-03-31  7:44   ` u-igbb
  2014-03-31 11:05 ` Szabolcs Nagy
  2014-04-01 16:40 ` Vasily Kulikov
  2 siblings, 1 reply; 7+ messages in thread
From: Rich Felker @ 2014-03-31  4:32 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 168 bytes --]

> Perhaps the best part is that this solution can be implemented in just
> a few lines of code.

And here's the patch. Please test and let me know if this works.

Rich

[-- Attachment #2: brk_fallback.diff --]
[-- Type: text/plain, Size: 1098 bytes --]

diff --git a/src/malloc/malloc.c b/src/malloc/malloc.c
index d6ad904..3c1ddcd 100644
--- a/src/malloc/malloc.c
+++ b/src/malloc/malloc.c
@@ -37,6 +37,7 @@ static struct {
 	struct bin bins[64];
 	int brk_lock[2];
 	int free_lock[2];
+	unsigned mmap_areas;
 } mal;
 
 
@@ -162,7 +163,31 @@ static struct chunk *expand_heap(size_t n)
 	new = mal.brk + n + SIZE_ALIGN + PAGE_SIZE - 1 & -PAGE_SIZE;
 	n = new - mal.brk;
 
-	if (__brk(new) != new) goto fail;
+	if (__brk(new) != new) {
+		size_t min = (size_t)PAGE_SIZE << ++mal.mmap_areas/2;
+		if (!min) mal.mmap_areas--;
+		n += -n & PAGE_SIZE-1;
+		if (n < min) n = min;
+		void *area = __mmap(0, n, PROT_READ|PROT_WRITE,
+			MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
+		if (area == MAP_FAILED) {
+			mal.mmap_areas = 0;
+			goto fail;
+		}
+
+		area = (char *)area + SIZE_ALIGN - OVERHEAD;
+		w = area;
+		n -= SIZE_ALIGN;
+		w->psize = 0 | C_INUSE;
+		w->csize = n | C_INUSE;
+		w = NEXT_CHUNK(w);
+		w->psize = n | C_INUSE;
+		w->csize = 0 | C_INUSE;
+
+		unlock(mal.brk_lock);
+
+		return area;
+	}
 
 	w = MEM_TO_CHUNK(new);
 	w->psize = n | C_INUSE;

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Proposed approach for malloc to deal with failing brk
  2014-03-31  4:32 ` Rich Felker
@ 2014-03-31  7:44   ` u-igbb
  0 siblings, 0 replies; 7+ messages in thread
From: u-igbb @ 2014-03-31  7:44 UTC (permalink / raw)
  To: musl

Hello Rich,

On Mon, Mar 31, 2014 at 12:32:48AM -0400, Rich Felker wrote:
> > Perhaps the best part is that this solution can be implemented in just
> > a few lines of code.
> 
> And here's the patch. Please test and let me know if this works.

Extremely appreciated (and a nice approach indeed, as far as I can see).

Now rebuilding a bunch of programs (including the gcc compiler itself)
with a gcc which uses the patched musl and so far it seems to work.
I guess this exercises malloc quite a bit.

(Under the gcc stages rebuilding, the loader is used implicitly and
presumably the heap exhaustion is not triggered, this confirms that the
patch did not damage the brk mode of operation. The compiler seems otherwise
to be capable of building everything I throw at it, while being run via the
standalone loder, which previously failed due to "no memory")

You saved my day, thanks Rich.

Rune

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Proposed approach for malloc to deal with failing brk
  2014-03-31  0:41 Proposed approach for malloc to deal with failing brk Rich Felker
  2014-03-31  4:32 ` Rich Felker
@ 2014-03-31 11:05 ` Szabolcs Nagy
  2014-04-01 16:40 ` Vasily Kulikov
  2 siblings, 0 replies; 7+ messages in thread
From: Szabolcs Nagy @ 2014-03-31 11:05 UTC (permalink / raw)
  To: musl

* Rich Felker <dalias@aerifal.cx> [2014-03-30 20:41:04 -0400]:
> We want brk. This is not because "brk is faster than mmap", but
> because it takes a lot of work to replicate what brk does using mmap,
> and there's no hope of making a complex dance of multiple syscalls
> equally efficient. My best idea for emulating brk was to mmap a huge

another reason to have brk: on some archs there is a TASK_UNMAPPED_BASE
limit in the kernel (1G normally) and mmap can only allocate above that

a large part of the first 1G is used for brk only (and top 1G is kernel)

so an mmap only allocator would limit the malloc space to 2G
(at least 32bit arm and mips i think)

> Once brk has failed, begin obtaining new blocks to add to the heap via
> mmap, with the size carefully chosen:
> 
>     MAX(requested_size, PAGE_SIZE<<(mmap_cnt/2))

yes this works, i added a regression test for brk failure


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Proposed approach for malloc to deal with failing brk
  2014-03-31  0:41 Proposed approach for malloc to deal with failing brk Rich Felker
  2014-03-31  4:32 ` Rich Felker
  2014-03-31 11:05 ` Szabolcs Nagy
@ 2014-04-01 16:40 ` Vasily Kulikov
  2014-04-01 17:01   ` Szabolcs Nagy
  2014-04-01 17:03   ` Rich Felker
  2 siblings, 2 replies; 7+ messages in thread
From: Vasily Kulikov @ 2014-04-01 16:40 UTC (permalink / raw)
  To: musl

On Sun, Mar 30, 2014 at 20:41 -0400, Rich Felker wrote:
> We want brk. This is not because "brk is faster than mmap", but
> because it takes a lot of work to replicate what brk does using mmap,
> and there's no hope of making a complex dance of multiple syscalls
> equally efficient. My best idea for emulating brk was to mmap a huge
> PROT_NONE region and gradually mprotect it to PROT_READ|PROT_WRITE,

What problem do you try to solve via PROT_NONE -> PROT_WRITE?  Why not
simply instantly mmap it as PROT_WRITE?  Linux will not allocate physical pages
until the first access, so you don't lose physical memory when it is not
actually used.

> but it turns out this is what glibc does for per-thread arenas and
> it's really slow, probably because it involves splitting one VMA and
> merging into another.

Yes, both VMA split/merge and PTE/etc. changes.

-- 
Vasily


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Proposed approach for malloc to deal with failing brk
  2014-04-01 16:40 ` Vasily Kulikov
@ 2014-04-01 17:01   ` Szabolcs Nagy
  2014-04-01 17:03   ` Rich Felker
  1 sibling, 0 replies; 7+ messages in thread
From: Szabolcs Nagy @ 2014-04-01 17:01 UTC (permalink / raw)
  To: musl

* Vasily Kulikov <segoon@openwall.com> [2014-04-01 20:40:57 +0400]:
> On Sun, Mar 30, 2014 at 20:41 -0400, Rich Felker wrote:
> > We want brk. This is not because "brk is faster than mmap", but
> > because it takes a lot of work to replicate what brk does using mmap,
> > and there's no hope of making a complex dance of multiple syscalls
> > equally efficient. My best idea for emulating brk was to mmap a huge
> > PROT_NONE region and gradually mprotect it to PROT_READ|PROT_WRITE,
> 
> What problem do you try to solve via PROT_NONE -> PROT_WRITE?  Why not

writable page is commit charge and that matters with a huge mmap
on systems with no overcommit

> simply instantly mmap it as PROT_WRITE?  Linux will not allocate physical pages
> until the first access, so you don't lose physical memory when it is not
> actually used.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Proposed approach for malloc to deal with failing brk
  2014-04-01 16:40 ` Vasily Kulikov
  2014-04-01 17:01   ` Szabolcs Nagy
@ 2014-04-01 17:03   ` Rich Felker
  1 sibling, 0 replies; 7+ messages in thread
From: Rich Felker @ 2014-04-01 17:03 UTC (permalink / raw)
  To: musl

On Tue, Apr 01, 2014 at 08:40:57PM +0400, Vasily Kulikov wrote:
> On Sun, Mar 30, 2014 at 20:41 -0400, Rich Felker wrote:
> > We want brk. This is not because "brk is faster than mmap", but
> > because it takes a lot of work to replicate what brk does using mmap,
> > and there's no hope of making a complex dance of multiple syscalls
> > equally efficient. My best idea for emulating brk was to mmap a huge
> > PROT_NONE region and gradually mprotect it to PROT_READ|PROT_WRITE,
> 
> What problem do you try to solve via PROT_NONE -> PROT_WRITE?  Why not
> simply instantly mmap it as PROT_WRITE?  Linux will not allocate physical pages
> until the first access, so you don't lose physical memory when it is not
> actually used.

Commit accounting. Committing 100 megs to a process that's only asked
for (and only going to use) 100k is harmful because, in effect, it
forces people to turn on (or leave on) overcommit.

> > but it turns out this is what glibc does for per-thread arenas and
> > it's really slow, probably because it involves splitting one VMA and
> > merging into another.
> 
> Yes, both VMA split/merge and PTE/etc. changes.

Well the page table changes happen even if you just use madvise to
zero/'free' the memory and then a page fault on write to get it back,
and the latter is very fast compared to the mprotect approach, at
least as far as I can tell. So I think the VMA split/merge is the big
issue.

Rich


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-04-01 17:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-31  0:41 Proposed approach for malloc to deal with failing brk Rich Felker
2014-03-31  4:32 ` Rich Felker
2014-03-31  7:44   ` u-igbb
2014-03-31 11:05 ` Szabolcs Nagy
2014-04-01 16:40 ` Vasily Kulikov
2014-04-01 17:01   ` Szabolcs Nagy
2014-04-01 17:03   ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).