From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3786 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: ARM memcpy post-0.9.12-release thread Date: Wed, 31 Jul 2013 02:13:37 -0400 Message-ID: <20130731061337.GC221@brightrain.aerifal.cx> References: <20130731022631.GA6655@brightrain.aerifal.cx> <20130731051347.7d8340ac@ralda.gmx.de> <20130731032315.GA221@brightrain.aerifal.cx> <20130731061858.07c30257@ralda.gmx.de> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1375251229 18754 80.91.229.3 (31 Jul 2013 06:13:49 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 31 Jul 2013 06:13:49 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-3790-gllmg-musl=m.gmane.org@lists.openwall.com Wed Jul 31 08:13:51 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1V4PfT-0005Nw-DH for gllmg-musl@plane.gmane.org; Wed, 31 Jul 2013 08:13:51 +0200 Original-Received: (qmail 31776 invoked by uid 550); 31 Jul 2013 06:13:50 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 31768 invoked from network); 31 Jul 2013 06:13:50 -0000 Content-Disposition: inline In-Reply-To: <20130731061858.07c30257@ralda.gmx.de> User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:3786 Archived-At: On Wed, Jul 31, 2013 at 06:18:58AM +0200, Harald Becker wrote: > Hi Rich ! > > 30-07-2013 23:23 Rich Felker : > > > > misaligned case happens mostly due to working with strings, > > > and those are usually short. Can't we consider other > > > misaligned cases violation of the programmer or code > > > generator? If so, I would prefer the best-attempt inline asm > > > versions of code or even best attempt C code over arch > > > specific asm versions ... and add > > > > Part of the problem discussed on #musl was that I was having to > > be really careful with "best attempt C" since GCC will > > _generate_ calls to memcpy for some code, even when > > -ffreestanding is used. The folks on #gcc claim this is not a > > bug. So, if compilers deem themselves at liberty to make this > > kind of transformation, any C implementation of memcpy that's > > not intentionally crippled (e.g. using volatile temps and 20x > > slower than it should be) is a time-bomb that might blow up on > > us with the next GCC version... > > I never deal with the details of this type of gcc code > generation, but doesn't this only happen on small and structure > copies? Structure copies which shall usually be aligned? So if > they are aligned the simpler version saves code space. I'm sorry, I don't think I was clear. The issue is that GCC recognizes certain patterns and generates calls to memcpy rather than doing the work inline. If it does this in memcpy.c, you end up with a version of memcpy that invokes infinite recursion and is thereby unusable. The issue I hit was that GCC was generating memcpy calls for copying struct { char block[32]; }, which has no alignment requirement. This technique was probably the best bet at getting the compiler to generate an efficient memcpy (in fact, it works quite well on some other archs), but on ARM it blew away the stack. When looking for a solution, however, I came across this: http://gcc.gnu.org/bugzilla//show_bug.cgi?id=56888 It looks to me like the situation is that, as compilers get smarter and smarter, it's going to become increasingly difficult to ensure that memcpy doesn't get compiled to a call to memcpy. So, my long term plan (this is still open to discussion) is to do something like this: Have one or more C memcpy implementations on-hand that empirically generate good code. For important archs, have hand-optimized asm; this is both smaller and better-performing than anything decent we can achieve with C. For archs where we don't yet have arm, generate asm from whichever C implementation works best. Then, instead of having the performance-oriented C in the source tree, have a fail-safe C version that the compiler can't possibly mess up; this ensures that future ports can get started without having to worry about whether the compiler breaks memcpy. > > This makes asm (either inline or standalone) a lot more > > appealing for memcpy than it otherwise would be. > > Optimization is always a question of decision, which I consider > the hard part of the job ... :( > > > > a warning for performance lose on misaligned data in > > > documentation, with giving a rough percentage of this lose. > > > > You'd prefer video processing being 4 to 5 times slower? > > No, definitely not, but video processing is one of the cases I > consider candidate for optimized processing. So such projects > shall include an optimize version of of low level processing > functions (including memcpy, but not only - candidate for > library with optimized functions?). Are you aware that redefining functions with the standardf names invokes undefined behavior? Yes you could write your own memcpy by another name, but then it can't get used by things like stdio (where, if it's slow, it's likely a large portion of time spent on file io), TLS image copying (per-thread startup cost), etc. Of all the functions in libc, memcpy is definitely the most performance-critical to the most applications. The other things that matter are malloc/free, math (sometimes), stdio, qsort, and searching/matching functions like regex, strstr, etc. > > Video typically consists of single-byte samples (planar YUV) and > > operations like cropping to a non-multiple-of-4 size, motion > > compensation, etc. all involve misaligned memcpy. Same goes for > > image transformations in gimp, image blitting in web browsers > > (not necessarily aligned to multiple-of-4 boundaries unless > > you're using 32bpp), etc... > > You are all right, but the programmer shall know of this and > consider to use appropriate functions. You can write the code for The programmer should write asm for 20 different archs? Most people have better things to do with their time.. Back to the point, musl is not dietlibc. If you want the smallest, lowest-quality imaginable libc, there's dietlibc you can use. musl's aim is to be a robust general-purpose libc. "Switch from Bionic to musl and make your apps run five times slower" is not appealing to anybody. If the choice were between having fully general, clean C code that runs 5-10% slower or giant gobs of asm with a heavy maintenance burden that runs 5-10% faster, I would probably agree and just figure the people who really need that last 5-10% can drop in some fancy asm. But that's not the situation we're in. The current code is half the speed of a decent (still probably not even the fastest) implementation for aligned copies and nearly five times slower for misaligned copies. That's well outside the range of "special interest" and into the range of "our implementation sucks". Moreover, the choice here is not between clean C and dirty asm. It's between dirty C and, well, whatever you think of the asm. The only "clean" C memcpy is: while (n--) *d++ = *s++; Our C memcpy depends on implementation-defined behavior (casting pointers to integers to inspect their alignment) as well as undefined behavior (aliasing violations to copy as size_t units). The latter cannot be detected by a compiler that's not performing LTO/whole program optimization, so it's "safe" for the most part, but it's still wrong. So from a standpoint of clean code, getting decent asm on all the archs and then possibly replacing the C with something more naive would probably be a step forward. Rich