From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3594
Path: news.gmane.org!not-for-mail
From: Andre Renaud <andre@bluewatersys.com>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: Thinking about release
Date: Wed, 10 Jul 2013 10:26:46 +1200
Message-ID: <CAPfzE3ZsMpC9d4VDZyHabhKOffOQW0dnG7Nwpm8EqVBLUXNZKg@mail.gmail.com>
References: <20130613012517.GA5859@brightrain.aerifal.cx>
	<CAPfzE3a0h=2NFqgnBqXj3J2q7VgYjqZ19Ab=0LAe5u5SvWXHaA@mail.gmail.com>
	<20130613014314.GC29800@brightrain.aerifal.cx>
	<CAPfzE3aerGrdmTkj15o0CTVtt8TZpTyAnSAj1Joau+Jb_cNGUA@mail.gmail.com>
	<20130709053711.GO29800@brightrain.aerifal.cx>
	<CAPfzE3ZTxynUeJjq7KWijZGhsV==NymW4vqLhnQbEYCXRxVf-g@mail.gmail.com>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
X-Trace: ger.gmane.org 1373408817 27395 80.91.229.3 (9 Jul 2013 22:26:57 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Tue, 9 Jul 2013 22:26:57 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-3598-gllmg-musl=m.gmane.org@lists.openwall.com Wed Jul 10 00:26:59 2013
Return-path: <musl-return-3598-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@plane.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-3598-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1UwgN8-00033R-T9
	for gllmg-musl@plane.gmane.org; Wed, 10 Jul 2013 00:26:58 +0200
Original-Received: (qmail 32205 invoked by uid 550); 9 Jul 2013 22:26:58 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 32196 invoked from network); 9 Jul 2013 22:26:58 -0000
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20120113;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:x-gm-message-state;
        bh=gIRqAPRW5xXygkgtCCe5GjVe+v8a8p4kfmIs/uvFMYw=;
        b=idBR58MKK8VK5TL7x2G3cxagXMUNKrj2Idwa9dwTLM6IBRWqD9RHi2AQg8Q2ot+NId
         K/t1bXCiyY7S//og2PuMplOZ/ZROu9V6Z1tH8URMTlnWbWdJsOH6nDFKiJ3eouTFVn+M
         c6eWRF5X3r8tPMMhyhD9W0y+vdVISZ5iU/e/WU6P3nI9YPM4voDbAufvCZxfZ1WBIpae
         n5p4msZfipK9k4ZbI9CVfzbXZFkQw5sbUcetholvLI3hR6D4WzlpXPXPw6pLgfPECAPF
         JLwzWeXAwf6eeSdrBht39hJOZ50wHlNgJcQaAyxkpdX4zZnAtaKqdwBlYSxlOHe38kHP
         UK5Q==
X-Received: by 10.58.54.70 with SMTP id h6mr17505480vep.36.1373408806208; Tue,
 09 Jul 2013 15:26:46 -0700 (PDT)
In-Reply-To: <CAPfzE3ZTxynUeJjq7KWijZGhsV==NymW4vqLhnQbEYCXRxVf-g@mail.gmail.com>
X-Gm-Message-State: ALoCoQlwJ4KohYDJTYPiJIUUO0veeI+s2FAiVz/ljDVdLlsAROvDDVZTUHLv0ji1Cgaq9Kr+9Ac1
Xref: news.gmane.org gmane.linux.lib.musl.general:3594
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/3594>

Replying to myself

> Certainly if there was a more straight forward C implementation that
> achieved similar results that would be superior. However the existing
> musl C memcpy code is already optimised to some degree (doing 32-bit
> rather than 8-bit copies), and it is difficult to convince gcc to use
> the load-multiple & store-multiple instructions via C code I've found,
> without resorting to pretty horrible C code. It may still be
> preferable to the assembler though. At this stage I haven't
> benchmarked this - I'll see if I can come up with something.

As a comparison, the existing memcpy.c implementation tries to copy
sizeof(size_t) bytes at a time, which on ARM is 4. This ends up being
a standard load/store. However GCC is smart enough to know that it can
use ldm/stm instructions for copying structures > 4 bytes. So if we
change memcpy.c to use a structure whose size is > 4 (ie: 16), instead
of size_t for it's basic copy unit, we do see some improvements:

typedef struct multiple_size_t {
    size_t d[4];
} multiple_size_t;

#define SS (sizeof(multiple_size_t))
#define ALIGN (sizeof(multiple_size_t)-1)

void *my_memcpy(void * restrict dest, const void * restrict src, size_t n)
{
    unsigned char *d = dest;
    const unsigned char *s = src;

    if (((uintptr_t)d & ALIGN) != ((uintptr_t)s & ALIGN))
        goto misaligned;

    for (; ((uintptr_t)d & ALIGN) && n; n--) *d++ = *s++;
    if (n) {
        multiple_size_t *wd = (void *)d;
        const struct multiple_size_t *ws = (const void *)s;

        for (; n>=SS; n-=SS) *wd++ = *ws++;

        d = (void *)wd;
        s = (const void *)ws;
misaligned:
        for (; n; n--) *d++ = *s++;
    }
    return dest;

}

This results in 95MB/s on my platform (up from 65MB/s for the existing
memcpy.c, and down from 105MB/s with the asm optimised version). It is
essentially identically readable to the existing memcpy.c. I'm not
really famiilar with any other cpu architectures, so I'm not sure if
this would improve, or hurt, performance on other platforms.

Any comments on using something like this for memcpy instead?
Obviously this gives you a higher penalty if the size of the area to
be copied is between sizeof(size_t) and sizeof(multiple_size_t).

Regards,
Andre