From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/10484 Path: news.gmane.org!.POSTED!not-for-mail From: Rob Landley Newsgroups: gmane.linux.lib.musl.general Subject: Re: Re: [J-core] Aligned copies and cacheline conflicts? Date: Fri, 16 Sep 2016 20:40:05 -0500 Message-ID: References: <0c256cb1-d0fa-9a5a-3976-b7ef545c1827@landley.net> <20160915003451.GC15995@brightrain.aerifal.cx> <8498eaa7-f263-efc8-a59c-d601e84af2db@landley.net> <20160915023644.GD15995@brightrain.aerifal.cx> <20160916221603.GS15995@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit X-Trace: blaine.gmane.org 1474076440 25538 195.159.176.226 (17 Sep 2016 01:40:40 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sat, 17 Sep 2016 01:40:40 +0000 (UTC) User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 Cc: "j-core@j-core.org" , musl@lists.openwall.com To: Rich Felker Original-X-From: musl-return-10497-gllmg-musl=m.gmane.org@lists.openwall.com Sat Sep 17 03:40:36 2016 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1bl4cE-0004vr-EQ for gllmg-musl@m.gmane.org; Sat, 17 Sep 2016 03:40:26 +0200 Original-Received: (qmail 27855 invoked by uid 550); 17 Sep 2016 01:40:26 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 27837 invoked from network); 17 Sep 2016 01:40:26 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=landley-net.20150623.gappssmtp.com; s=20150623; h=subject:to:references:cc:from:message-id:date:user-agent :mime-version:in-reply-to:content-transfer-encoding; bh=qanJNLdwgewRjtqopV0Y0h266i7YhmUOqPKYZlW77Qo=; b=fQKUQD4ob4sN0LuLfnJTuBVbZfVP9GUn3nvESNTqkXHDvwzW7qPBZAjKuUnvDisrc4 ZFvV2pUz07xP0G21M5vlFU+5SS6kfHuR9mGI3gdOcM1qvvV5qWk3MLIZkmCF/2b8SpdY AtENjZxzjOzxSwk/H4r0pWKHPbLpajOpx1TIRQPq1cvFSGpsIM50d+s0GRwQ/vUwOU/Z DFSSDr/DPP/kJqzVIIOV05dENu1xuOTTax8DN70ZMG1muVbCnY7cubPjXym7Nbe8jtu3 HnJ0psevq9/FhXzh1L25cVZFDsp2J0z4fG+oNKg5hczqjnMevmVCSz6gV5QgEwA33r/M jgSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding; bh=qanJNLdwgewRjtqopV0Y0h266i7YhmUOqPKYZlW77Qo=; b=HO8NRsmMB3pOp2ev5d8kebhm5Wvd+yoUJcXiQu1OhBbw0HpAsqhTjlvWOTh5CnLpUc aKg+wh0J2eb3wtcSlazVU2LSUsGa/svg6RDRjHPA41taQPspWyIx0w5GZ22cHuZipqAR 89ABcIBMJt3dWCHV99qFg+BdnsxNXD12uIa+vtIGrKdvWJqrA8HZrMrWUPRAAqvQOGdh t+lX+A7PmFppwvp2tTHsCdxAXeFpQa/wLt+zyEGdtGkMlO9IfPWLpovyo4z1qUlk7Mwg 1AgiQwR/E1oVU5nsxOcINd3fscL9qvKgLJqSIw2GXRfvdndlGqBJarZ3TVoHsGrmF59Q xL+A== X-Gm-Message-State: AE9vXwOhtdJP9n3CQ/dUKwavC8WgnRUyqrVYtoX2yeElDKP4nHu2p4FVEPykY6EgNy9jJQ== X-Received: by 10.98.80.136 with SMTP id g8mr13983488pfj.185.1474076413897; Fri, 16 Sep 2016 18:40:13 -0700 (PDT) In-Reply-To: <20160916221603.GS15995@brightrain.aerifal.cx> Xref: news.gmane.org gmane.linux.lib.musl.general:10484 Archived-At: On 09/16/2016 05:16 PM, Rich Felker wrote: > Attached is a draft memcpy I'm considering for musl. Compared to the > current one, it: > > 1. Works on 32 bytes per iteration, and adds barriers between the load > phase and store phase to preclude cache line aliasing between src > and dest with a direct-mapped cache. > > 2. Equally unrolls the misaligned src/dest cases. > > 3. Adjusts the offsets used in the misaligned src/dest loops to all be > multiples of 4, with the adjustments to make that work outside the > loops. This helps compilers generate indexed addressing modes (e.g. > @(4,Rm)) rather than having to resort to arithmetic. > > 4. Factors the misaligned cases into a common inline function to > reduce code duplication. > > Comments welcome. Superficial comments first: I know the compiler's probably smart enough to convert %4 into &3, but given that the point is performance optimization I'd have thought you'd be explicit about what the machine should be doing? Both chunks of code have their own 8 register read and 8 register write (one is 0-7, one is 1-8). Design comments: Instead of optimized per-target assembly, you have an #ifdef gnuc wrapped around just under 70 lines of C code with an __asm__ __volatile__ blob in the middle, calling a 20 line C function. Because presumably on sh this will produce roughly the same workaround for the primitive cache architecture, only now you're doing it indirectly and applying it to everybody. The motivation for this is that j2 has a more primitive cache architecture than normal these days, so it needs an optimization most other chips don't. This is "generic" so it'll be built on register-constrained 32 bit x86, and on 64 bit systems where it should presumably be using u64 not u32. And of course gcc inlines its own version unless you hit it with a brick anyway, so solving it in musl is of questionable utility. I'm not sure you're focusing on the right problem? > Rich Rob