From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/11653 Path: news.gmane.org!.POSTED!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: [PATCH] optimize malloc0 Date: Tue, 4 Jul 2017 17:45:54 -0400 Message-ID: <20170704214554.GS1627@brightrain.aerifal.cx> References: <20170626214339.10942-1-amonakov@ispras.ru> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: blaine.gmane.org 1499204768 19610 195.159.176.226 (4 Jul 2017 21:46:08 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 4 Jul 2017 21:46:08 +0000 (UTC) User-Agent: Mutt/1.5.21 (2010-09-15) To: musl@lists.openwall.com Original-X-From: musl-return-11666-gllmg-musl=m.gmane.org@lists.openwall.com Tue Jul 04 23:46:04 2017 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1dSVdz-0004sa-Im for gllmg-musl@m.gmane.org; Tue, 04 Jul 2017 23:46:03 +0200 Original-Received: (qmail 28107 invoked by uid 550); 4 Jul 2017 21:46:06 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 28080 invoked from network); 4 Jul 2017 21:46:06 -0000 Content-Disposition: inline In-Reply-To: <20170626214339.10942-1-amonakov@ispras.ru> Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:11653 Archived-At: On Tue, Jun 27, 2017 at 12:43:39AM +0300, Alexander Monakov wrote: > Implementation of __malloc0 in malloc.c takes care to preserve zero > pages by overwriting only non-zero data. However, malloc must have > already modified auxiliary heap data just before and beyond the > allocated region, so we know that edge pages need not be preserved. > > For allocations smaller than one page, pass them immediately to memset. > Otherwise, use memset to handle partial pages at the head and tail of > the allocation, and scan complete pages in the interior. Optimize the > scanning loop by processing 16 bytes per iteration and handling rest of > page via memset as soon as a non-zero byte is found. > --- > A followup to a recent IRC discussion. Code size cost on x86 is about just 80 > bytes (note e.g. how mal0_clear uses memset for two purposes simultaneously, > handling the partial page at the end, and clearing interior non-zero pages). > > On a Sandy Bridge CPU, speed improvement for the potentially-zero-page scanning > loop is almost 2x on 64-bit, almost 3x on 32-bit. > > Note that existing implementation can over-clear by as much as sizeof(size_t)-1 > beyond the allocation, the new implementation never does that. This may expose > application bugs that were hidden before. > > Alexander > > src/malloc/malloc.c | 25 +++++++++++++++++++------ > 1 file changed, 19 insertions(+), 6 deletions(-) > > diff --git a/src/malloc/malloc.c b/src/malloc/malloc.c > index d5ee4280..720fa696 100644 > --- a/src/malloc/malloc.c > +++ b/src/malloc/malloc.c > @@ -366,15 +366,28 @@ void *malloc(size_t n) > return CHUNK_TO_MEM(c); > } > > +static size_t mal0_clear(char *p, size_t pagesz, size_t n) > +{ > + typedef unsigned long long T; > + char *pp = p + n; > + size_t i = (uintptr_t)pp & (pagesz - 1); > + for (;;) { > + pp = memset(pp - i, 0, i); > + if (pp - p < pagesz) return pp - p; > + for (i = pagesz; i; i -= 2*sizeof(T), pp -= 2*sizeof(T)) > + if (((T *)pp)[-1] | ((T *)pp)[-2]) > + break; > + } > +} > + > void *__malloc0(size_t n) > { > void *p = malloc(n); > - if (p && !IS_MMAPPED(MEM_TO_CHUNK(p))) { > - size_t *z; > - n = (n + sizeof *z - 1)/sizeof *z; > - for (z=p; n; n--, z++) if (*z) *z=0; > - } > - return p; > + if (!p || IS_MMAPPED(MEM_TO_CHUNK(p))) > + return p; > + if (n >= PAGE_SIZE) > + n = mal0_clear(p, PAGE_SIZE, n); > + return memset(p, 0, n); > } > > void *realloc(void *p, size_t n) > -- > 2.11.0 Overall I like this. Reviewing what was discussed on IRC, I called the loop logic clever and nsz said maybe a bit too clever. On further reading I think he's right. One additional concern was that the reverse-scanning may be bad for performance. I suggested it might work just as well to restructure the loop as "for each word, if nonzero, memset to end of page and advance to that point". A cheap way to avoid the scanning logic for the first and last partial page, while not complicating the loop logic, would be just writing a nonzero value to the first byte of each before the loop. Rich