From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/6967 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: Re: [PATCH] x86_64/memset: simple optimizations Date: Tue, 10 Feb 2015 15:43:42 -0500 Message-ID: <20150210204342.GJ23507@brightrain.aerifal.cx> References: <1423258814-9045-1-git-send-email-vda.linux@googlemail.com> <20150207003535.GS23507@brightrain.aerifal.cx> <20150207130655.GW23507@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1423601038 12940 80.91.229.3 (10 Feb 2015 20:43:58 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 10 Feb 2015 20:43:58 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-6980-gllmg-musl=m.gmane.org@lists.openwall.com Tue Feb 10 21:43:58 2015 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1YLHf2-0003px-Pt for gllmg-musl@m.gmane.org; Tue, 10 Feb 2015 21:43:56 +0100 Original-Received: (qmail 4039 invoked by uid 550); 10 Feb 2015 20:43:55 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 4025 invoked from network); 10 Feb 2015 20:43:54 -0000 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:6967 Archived-At: On Tue, Feb 10, 2015 at 09:27:17PM +0100, Denys Vlasenko wrote: > On Sat, Feb 7, 2015 at 2:06 PM, Rich Felker wrote: > > On Sat, Feb 07, 2015 at 01:49:43PM +0100, Denys Vlasenko wrote: > >> On Sat, Feb 7, 2015 at 1:35 AM, Rich Felker wrote: > >> What speedups? > >> In particular: > >> - perform pre-alignment if dst is unaligned > > > > For the rep stosq path? Does it help? I don't recall the details but I > > seem to remember both docs and measurements showing no reliable > > benefit from alignment for this instruction, and we had people trying > > things on several different cpu models. I'm open to hearing evidence > > to the contrary though. > > size:20k buf:0x7f38656e2100 > stos:25978 ns (times 32), 25.227500 bytes/ns > stos+1:31395 ns (times 32), 20.874662 bytes/ns > stos+4:31396 ns (times 32), 20.873997 bytes/ns > stos+8:24446 ns (times 32), 26.808476 bytes/ns > > size:50k buf:0x7fbca1dc9100 > stos:68149 ns (times 32), 24.041439 bytes/ns > stos+1:85762 ns (times 32), 19.104032 bytes/ns > stos+4:85762 ns (times 32), 19.104032 bytes/ns > stos+8:68204 ns (times 32), 24.022051 bytes/ns > > size:1024k buf:0x7fa3036a5100 > stos:1632285 ns (times 32), 20.556724 bytes/ns > stos+1:1891092 ns (times 32), 17.743416 bytes/ns > stos+4:1891089 ns (times 32), 17.743444 bytes/ns > stos+8:1632181 ns (times 32), 20.558034 bytes/ns > > size:5000k buf:0x7fdf5cd6b100 > stos:15592138 ns (times 32), 10.558298 bytes/ns > stos+1:15501841 ns (times 32), 10.619799 bytes/ns > stos+4:15507773 ns (times 32), 10.615737 bytes/ns > stos+8:15589617 ns (times 32), 10.560005 bytes/ns > > The source is attached. OK. This looks sufficiently significant (despite unaligned memsets being rare) that it would be nice to optimize it. Could we just write an initial possibly-misaligned word then increment the start address and round it up before using rep stos? > #define _GNU_SOURCE > #include > #include > #include > #include > #include > #include > #include > #include > /* Old glibc (< 2.3.4) does not provide this constant. We use syscall > * directly so this definition is safe. */ > #ifndef CLOCK_MONOTONIC > #define CLOCK_MONOTONIC 1 > #endif > > /* libc has incredibly messy way of doing this, > * typically requiring -lrt. We just skip all this mess */ > static void get_mono(struct timespec *ts) > { > syscall(__NR_clock_gettime, CLOCK_MONOTONIC, ts); > } FWIW, this is a bad idea; you get syscall overhead in your measurements. If you just use clock_gettime (the function) you'll get vdso results (no syscall). Using the syscall directly is also sketchy in that x32 has an incorrect kernel-side definition for struct timespec, but I think it will only matter if aarch64-ILP32 copies this problem from x32 and you're using a big-endian system. Rich