From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/7101 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: x86[_64] memset and rep stos Date: Wed, 25 Feb 2015 12:26:20 -0500 Message-ID: <20150225172620.GA1414@brightrain.aerifal.cx> References: <20150225061204.GA25485@brightrain.aerifal.cx> <20150225092020.GW32724@port70.net> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1424885228 11844 80.91.229.3 (25 Feb 2015 17:27:08 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 25 Feb 2015 17:27:08 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-7114-gllmg-musl=m.gmane.org@lists.openwall.com Wed Feb 25 18:27:03 2015 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1YQfjj-0005VS-41 for gllmg-musl@m.gmane.org; Wed, 25 Feb 2015 18:27:03 +0100 Original-Received: (qmail 22346 invoked by uid 550); 25 Feb 2015 17:27:00 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 22161 invoked from network); 25 Feb 2015 17:26:32 -0000 Content-Disposition: inline In-Reply-To: <20150225092020.GW32724@port70.net> User-Agent: Mutt/1.5.21 (2010-09-15) Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:7101 Archived-At: On Wed, Feb 25, 2015 at 10:20:20AM +0100, Szabolcs Nagy wrote: > * ?????? [2015-02-25 15:54:31 +0800]: > > I'm not an expert on micro optimization, but why not use a dynamic > > routine selection system which would select the optimal routine for a > > given CPU during program initialization. The routine selection > > algorithm could simply be a predefined static table look up. > > IMO, only very small number of functions (like memset, memcpy) would > > benefit from such a system, so no code size overhead to worry about. > > my guess is > > - for static linking it adds at least an extra indirection > and these functions often used with small input Yes. Actually the extra indirection seems not to be measurable on modern cpus but it may matter on older cpus, and older cpus are the only reason to want runtime selection since the optimal strategy is not changing for modern ones. > - code size overhead: now you have to include all possible > versions in libc.so Or worse, in a static binary. > - for dynamic linking there is a load time dispatch mechanism: > STT_GNU_IFUNC but it's broken due to lack of specs It also wouldn't get resolved before libc.so/ld.so needed to call these functions. > - maintainance burden, hard to test This is the main one -- unbounded growth of maintenance and debugging burden -- how would you test N different versions of the same code without either having a test system for each cpu model, or extra hacks to force a particular version of the code to get used even when it mismatches the host cpu model? Who is responsible for ensuring that the "optimized" variants for old cpus are actually still faster than all the new variants? Etc. These issues are not just theoretical; they're a real world problem that's plagued glibc even though they have a much bigger team maintaining it. And uclibc basically fell apart (IMO) because of inability to test and maintain all the combinations of options (although it was other types of options, not optimized asm variants). > - selecting the right algorithm at runtime is not easy Well, it involves the same tedious (but sometimes fun) work Denys and I have been doing, except doing it with lots of different cpu models and code variants (i.e. much more work). > but i guess eventually when more ppl use musl it will make sense > to add more target specific optimizations Target-specific, yes, at least for targets that are being used in places where there are measurable bottlenecks. But I'd still rather avoid code variants for cpu models within the same ISA. If it's looking like that kind of thing is needed, perhaps we should try to model the relevant functions with "GNU C" (with portable fallback of course, as always) and have the compiler generate the appropriate variants (i.e. outsourcing trust that the generated code is actually correct to the compiler) rather than writing multiple asm versions. Rich