From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3969 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Optimized C memset Date: Tue, 27 Aug 2013 04:30:20 -0400 Message-ID: <20130827083020.GA4503@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="PNTmBPCT7hxwcZjr" X-Trace: ger.gmane.org 1377592233 18021 80.91.229.3 (27 Aug 2013 08:30:33 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 27 Aug 2013 08:30:33 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-3973-gllmg-musl=m.gmane.org@lists.openwall.com Tue Aug 27 10:30:36 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VEEfb-0007wW-ET for gllmg-musl@plane.gmane.org; Tue, 27 Aug 2013 10:30:35 +0200 Original-Received: (qmail 17824 invoked by uid 550); 27 Aug 2013 08:30:33 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 17813 invoked from network); 27 Aug 2013 08:30:33 -0000 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:3969 Archived-At: --PNTmBPCT7hxwcZjr Content-Type: text/plain; charset=us-ascii Content-Disposition: inline I'm sending this to the list before committing it just to get some comments/feedback. The key feature of this memset, much like the x86 asm, is that it write from both ends in a possibly-overlapping manner to minimize the number of branches. Unlike in the asm, though, I've also used the write-from-both-ends logic to allow trivial alignment handling. One aspect of this code that may appear ugly at first is the usage of the __GNUC__ macro. I've been bothered for a long time by the aliasing violations in src/string/*.c which are only "safe" insomuch as the compiler cannot see across extern function calls. The purpose of checking for __GNUC__ and using the may_alias attribute is to document to the compiler that aliasing is taking place in a controlled manner. If we don't have a compiler that accepts this attribute, the code falls back to using a naive loop with no aliasing violations. The prologue code, including alignment, is still kept, so that optimizing compilers can tell that the pointer is aligned when the naive loop is reached, possibly optimizing it back into something fast. (In fact, with -msse, gcc is able to make the naive version nearly twice as fast as the fancy C version, but unfortunately it's unable to do any gp-register based vectorization for non-SIMD targets. At some point we may want to add an override to turn off the fancy C code and let the compiler do all the work...) So, I'd like to consider gradually transitioning all of the string code that breaks the aliasing rules over to using an approach like this. Any thoughts on this? I hope it's not too ugly, but I don't know any other way that improves correctness and maintains or improves performance. By the way, this new code obsoletes the memset asm for i386 and x86_64 that was added during this release cycle, so I guess I should just delete the asm. I tried some simple improvements to the asm to make it faster, but couldn't come close to beating the new C code. Rich --PNTmBPCT7hxwcZjr Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="memset5.c" #include #include #if 100*__GNUC__+__GNUC_MINOR__ >= 302 #define may_alias __attribute__((__may_alias__)) #else #define may_alias #endif typedef uint32_t may_alias u32; typedef uint64_t may_alias u64; void *memset(void *dest, int c, size_t n) { unsigned char *s = dest; u32 c32; u64 c64; size_t k; if (!n) return dest; s[0] = s[n-1] = c; if (n <= 2) return dest; s[1] = s[n-2] = c; s[2] = s[n-3] = c; if (n <= 6) return dest; s[3] = s[n-4] = c; if (n <= 8) return dest; k = -(uintptr_t)s & 3; s += k; n -= k; n &= -3; #ifdef __GNUC__ c32 = ((u32)-1)/255 * (unsigned char)c; *(u32 *)(s+0) = c32; *(u32 *)(s+n-4) = c32; if (n <= 8) return dest; *(u32 *)(s+4) = c32; *(u32 *)(s+8) = c32; *(u32 *)(s+n-12) = c32; *(u32 *)(s+n-8) = c32; if (n <= 24) return dest; *(u32 *)(s+12) = c32; *(u32 *)(s+n-16) = c32; s = (void *)((uintptr_t)(s+16) & -8); n -= 24; c64 = c32 | ((u64)c32 << 32); for (; n >= 32; n-=32, s+=32) { *(u64 *)(s+0) = c64; *(u64 *)(s+8) = c64; *(u64 *)(s+16) = c64; *(u64 *)(s+24) = c64; } #else for (; n; n--, s++) *s = c; #endif return dest; } --PNTmBPCT7hxwcZjr--