Re: [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access

Development discussion of WireGuard
 help / color / mirror / Atom feed

* Re: [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
       [not found]     ` <20161103.130852.1456848512897088071.davem@davemloft.net>
@ 2016-11-03 22:20       ` Jason A. Donenfeld
  2016-11-04 17:37         ` Eric Biggers
  0 siblings, 1 reply; 8+ messages in thread
From: Jason A. Donenfeld @ 2016-11-03 22:20 UTC (permalink / raw)
  To: David Miller
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, WireGuard mailing list

Hi David,

On Thu, Nov 3, 2016 at 6:08 PM, David Miller <davem@davemloft.net> wrote:
> In any event no piece of code should be doing 32-bit word reads from
> addresses like "x + 3" without, at a very minimum, going through the
> kernel unaligned access handlers.

Excellent point. In otherwords,

    ctx->r[0] =3D (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
    ctx->r[1] =3D (le32_to_cpuvp(key +  3) >> 2) & 0x3ffff03;
    ctx->r[2] =3D (le32_to_cpuvp(key +  6) >> 4) & 0x3ffc0ff;
    ctx->r[3] =3D (le32_to_cpuvp(key +  9) >> 6) & 0x3f03fff;
    ctx->r[4] =3D (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;

should change to:

    ctx->r[0] =3D (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
    ctx->r[1] =3D (get_unaligned_le32(key +  3) >> 2) & 0x3ffff03;
    ctx->r[2] =3D (get_unaligned_le32(key +  6) >> 4) & 0x3ffc0ff;
    ctx->r[3] =3D (get_unaligned_le32(key +  9) >> 6) & 0x3f03fff;
    ctx->r[4] =3D (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;

> We know explicitly that these offsets will not be 32-bit aligned, so
> it is required that we use the helpers, or alternatively do things to
> avoid these unaligned accesses such as using temporary storage when
> the HAVE_EFFICIENT_UNALIGNED_ACCESS kconfig value is not set.

So the question is: is the clever avoidance of unaligned accesses of
the original patch faster or slower than changing the unaligned
accesses to use the helper function?

I've put a little test harness together for playing with this:

    $ git clone git://git.zx2c4.com/polybench
    $ cd polybench
    $ make run

To test with one method, do as normal. To test with the other, remove
"#define USE_FIRST_METHOD" from the source code.

@Ren=C3=A9: do you think you could retest on your MIPS32r2 hardware and
report back which is faster?

And if anybody else has other hardware and would like to try, this
could be nice.

Regards,
Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-03 22:20       ` [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access Jason A. Donenfeld
@ 2016-11-04 17:37         ` Eric Biggers
  2016-11-07 18:08           ` Jason A. Donenfeld
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Biggers @ 2016-11-04 17:37 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

On Thu, Nov 03, 2016 at 11:20:08PM +0100, Jason A. Donenfeld wrote:
> Hi David,
> 
> On Thu, Nov 3, 2016 at 6:08 PM, David Miller <davem@davemloft.net> wrote:
> > In any event no piece of code should be doing 32-bit word reads from
> > addresses like "x + 3" without, at a very minimum, going through the
> > kernel unaligned access handlers.
> 
> Excellent point. In otherwords,
> 
>     ctx->r[0] = (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
>     ctx->r[1] = (le32_to_cpuvp(key +  3) >> 2) & 0x3ffff03;
>     ctx->r[2] = (le32_to_cpuvp(key +  6) >> 4) & 0x3ffc0ff;
>     ctx->r[3] = (le32_to_cpuvp(key +  9) >> 6) & 0x3f03fff;
>     ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;
> 
> should change to:
> 
>     ctx->r[0] = (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
>     ctx->r[1] = (get_unaligned_le32(key +  3) >> 2) & 0x3ffff03;
>     ctx->r[2] = (get_unaligned_le32(key +  6) >> 4) & 0x3ffc0ff;
>     ctx->r[3] = (get_unaligned_le32(key +  9) >> 6) & 0x3f03fff;
>     ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;
> 

I agree, and the current code is wrong; but do note that this proposal is
correct for poly1305_setrkey() but not for poly1305_setskey() and
poly1305_blocks().  In the latter two cases, 4-byte alignment of the source
buffer is *not* guaranteed.  Although crypto_poly1305_update() will be called
with a 4-byte aligned buffer due to the alignmask set on poly1305_alg, the
algorithm operates on 16-byte blocks and therefore has to buffer partial blocks.
If some number of bytes that is not 0 mod 4 is buffered, then the buffer will
fall out of alignment on the next update call.  Hence, get_unaligned_le32() is
actually needed on all the loads, since the buffer will, in general, be of
unknown alignment.

Note: some other shash algorithms have this problem too and do not handle it
correctly.  It seems to be a common mistake.

Eric

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-04 17:37         ` Eric Biggers
@ 2016-11-07 18:08           ` Jason A. Donenfeld
  2016-11-07 18:23             ` Jason A. Donenfeld
  2016-11-07 18:26             ` Eric Biggers
  0 siblings, 2 replies; 8+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 18:08 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

Hi Eric,

On Fri, Nov 4, 2016 at 6:37 PM, Eric Biggers <ebiggers@google.com> wrote:
> I agree, and the current code is wrong; but do note that this proposal is
> correct for poly1305_setrkey() but not for poly1305_setskey() and
> poly1305_blocks().  In the latter two cases, 4-byte alignment of the source
> buffer is *not* guaranteed.  Although crypto_poly1305_update() will be called
> with a 4-byte aligned buffer due to the alignmask set on poly1305_alg, the
> algorithm operates on 16-byte blocks and therefore has to buffer partial blocks.
> If some number of bytes that is not 0 mod 4 is buffered, then the buffer will
> fall out of alignment on the next update call.  Hence, get_unaligned_le32() is
> actually needed on all the loads, since the buffer will, in general, be of
> unknown alignment.

Hmm... The general data flow that strikes me as most pertinent is
something like:

struct sk_buff *skb = get_it_from_somewhere();
skb = skb_share_check(skb, GFP_ATOMIC);
num_frags = skb_cow_data(skb, ..., ...);
struct scatterlist sg[num_frags];
sg_init_table(sg, num_frags);
skb_to_sgvec(skb, sg, ..., ...);
blkcipher_walk_init(&walk, sg, sg, len);
blkcipher_walk_virt_block(&desc, &walk, BLOCK_SIZE);
while (walk.nbytes >= BLOCK_SIZE) {
    size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE);
    poly1305_update(&poly1305_state, walk.src.virt.addr, chunk_len);
    blkcipher_walk_done(&desc, &walk, walk.nbytes % BLOCK_SIZE);
}
if (walk.nbytes) {
    poly1305_update(&poly1305_state, walk.src.virt.addr, walk.nbytes);
    blkcipher_walk_done(&desc, &walk, 0);
}

Is your suggestion that that in the final if block, walk.src.virt.addr
might be unaligned? Like in the case of the last fragment being 67
bytes long?

If so, what a hassle. I hope the performance overhead isn't too
awful... I'll resubmit taking into account your suggestions.

By the way -- offlist benchmarks sent to me concluded that using the
unaligned load helpers like David suggested is just as fast as that
handrolled bit magic in the v1.

Regards,
Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-07 18:08           ` Jason A. Donenfeld
@ 2016-11-07 18:23             ` Jason A. Donenfeld
  2016-11-07 18:26             ` Eric Biggers
  1 sibling, 0 replies; 8+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 18:23 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

On Mon, Nov 7, 2016 at 7:08 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hmm... The general data flow that strikes me as most pertinent is
> something like:
>
> struct sk_buff *skb = get_it_from_somewhere();
> skb = skb_share_check(skb, GFP_ATOMIC);
> num_frags = skb_cow_data(skb, ..., ...);
> struct scatterlist sg[num_frags];
> sg_init_table(sg, num_frags);
> skb_to_sgvec(skb, sg, ..., ...);
> blkcipher_walk_init(&walk, sg, sg, len);
> blkcipher_walk_virt_block(&desc, &walk, BLOCK_SIZE);
> while (walk.nbytes >= BLOCK_SIZE) {
>     size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE);
>     poly1305_update(&poly1305_state, walk.src.virt.addr, chunk_len);
>     blkcipher_walk_done(&desc, &walk, walk.nbytes % BLOCK_SIZE);
> }
> if (walk.nbytes) {
>     poly1305_update(&poly1305_state, walk.src.virt.addr, walk.nbytes);
>     blkcipher_walk_done(&desc, &walk, 0);
> }
>
> Is your suggestion that that in the final if block, walk.src.virt.addr
> might be unaligned? Like in the case of the last fragment being 67
> bytes long?

In fact, I'm not so sure this happens here. In the while loop, each
new walk.src.virt.addr will be aligned to BLOCK_SIZE or be aligned by
virtue of being at the start of a new page. In the subsequent if
block, walk.src.virt.addr will either be
some_aligned_address+BLOCK_SIZE, which will be aligned, or it will be
a start of a new page, which will be aligned.

So what did you have in mind exactly?

I don't think anybody is running code like:

for (size_t i = 0; i < len; i += 17)
    poly1305_update(&poly, &buffer[i], 17);

(And if so, those consumers should be fixed.)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-07 18:08           ` Jason A. Donenfeld
  2016-11-07 18:23             ` Jason A. Donenfeld
@ 2016-11-07 18:26             ` Eric Biggers
  2016-11-07 19:02               ` Jason A. Donenfeld
  1 sibling, 1 reply; 8+ messages in thread
From: Eric Biggers @ 2016-11-07 18:26 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

On Mon, Nov 07, 2016 at 07:08:22PM +0100, Jason A. Donenfeld wrote:
> Hmm... The general data flow that strikes me as most pertinent is
> something like:
> 
> struct sk_buff *skb = get_it_from_somewhere();
> skb = skb_share_check(skb, GFP_ATOMIC);
> num_frags = skb_cow_data(skb, ..., ...);
> struct scatterlist sg[num_frags];
> sg_init_table(sg, num_frags);
> skb_to_sgvec(skb, sg, ..., ...);
> blkcipher_walk_init(&walk, sg, sg, len);
> blkcipher_walk_virt_block(&desc, &walk, BLOCK_SIZE);
> while (walk.nbytes >= BLOCK_SIZE) {
>     size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE);
>     poly1305_update(&poly1305_state, walk.src.virt.addr, chunk_len);
>     blkcipher_walk_done(&desc, &walk, walk.nbytes % BLOCK_SIZE);
> }
> if (walk.nbytes) {
>     poly1305_update(&poly1305_state, walk.src.virt.addr, walk.nbytes);
>     blkcipher_walk_done(&desc, &walk, 0);
> }
> 
> Is your suggestion that that in the final if block, walk.src.virt.addr
> might be unaligned? Like in the case of the last fragment being 67
> bytes long?

I was not referring to any users in particular, only what users could do.  As an
example, if you did crypto_shash_update() with 32, 15, then 17 bytes, and the
underlying algorithm is poly1305-generic, the last block would end up
misaligned.  This doesn't appear possible with your pseudocode because it only
passes in multiples of the block size until the very end.  However I don't see
it claimed anywhere that shash API users have to do that.

Eric

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-07 18:26             ` Eric Biggers
@ 2016-11-07 19:02               ` Jason A. Donenfeld
  2016-11-07 19:25                 ` Eric Biggers
  0 siblings, 1 reply; 8+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 19:02 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

On Mon, Nov 7, 2016 at 7:26 PM, Eric Biggers <ebiggers@google.com> wrote:
>
> I was not referring to any users in particular, only what users could do.  As an
> example, if you did crypto_shash_update() with 32, 15, then 17 bytes, and the
> underlying algorithm is poly1305-generic, the last block would end up
> misaligned.  This doesn't appear possible with your pseudocode because it only
> passes in multiples of the block size until the very end.  However I don't see
> it claimed anywhere that shash API users have to do that.

Actually it appears that crypto/poly1305_generic.c already buffers
incoming blocks to a buffer that definitely looks aligned, to prevent
this condition!

I'll submit a v2 with only the inner unaligned operations changed.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-07 19:02               ` Jason A. Donenfeld
@ 2016-11-07 19:25                 ` Eric Biggers
  2016-11-07 19:41                   ` Jason A. Donenfeld
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Biggers @ 2016-11-07 19:25 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

On Mon, Nov 07, 2016 at 08:02:35PM +0100, Jason A. Donenfeld wrote:
> On Mon, Nov 7, 2016 at 7:26 PM, Eric Biggers <ebiggers@google.com> wrote:
> >
> > I was not referring to any users in particular, only what users could do.  As an
> > example, if you did crypto_shash_update() with 32, 15, then 17 bytes, and the
> > underlying algorithm is poly1305-generic, the last block would end up
> > misaligned.  This doesn't appear possible with your pseudocode because it only
> > passes in multiples of the block size until the very end.  However I don't see
> > it claimed anywhere that shash API users have to do that.
> 
> Actually it appears that crypto/poly1305_generic.c already buffers
> incoming blocks to a buffer that definitely looks aligned, to prevent
> this condition!
> 

No it does *not* buffer all incoming blocks, which is why the source pointer can
fall out of alignment.  Yes, I actually tested this.  In fact this situation is
even hit, in both possible places, in the self-tests.

Eric

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-07 19:25                 ` Eric Biggers
@ 2016-11-07 19:41                   ` Jason A. Donenfeld
  0 siblings, 0 replies; 8+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 19:41 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

On Mon, Nov 7, 2016 at 8:25 PM, Eric Biggers <ebiggers@google.com> wrote:
> No it does *not* buffer all incoming blocks, which is why the source pointer can
> fall out of alignment.  Yes, I actually tested this.  In fact this situation is
> even hit, in both possible places, in the self-tests.

Urgh! v3 coming right up...

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-11-07 19:39 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAHmME9ogYTGFaNDt1CD0FxEHxDzVhNX=AN3_PH3t=0zREGgYPA@mail.gmail.com>
     [not found] ` <20161103004934.GA30775@gondor.apana.org.au>
     [not found]   ` <CAHmME9oL-pOWWXXFhJz1vSm5ftnSfmYrquGbH0acapgEL=c4Ew@mail.gmail.com>
     [not found]     ` <20161103.130852.1456848512897088071.davem@davemloft.net>
2016-11-03 22:20       ` [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access Jason A. Donenfeld
2016-11-04 17:37         ` Eric Biggers
2016-11-07 18:08           ` Jason A. Donenfeld
2016-11-07 18:23             ` Jason A. Donenfeld
2016-11-07 18:26             ` Eric Biggers
2016-11-07 19:02               ` Jason A. Donenfeld
2016-11-07 19:25                 ` Eric Biggers
2016-11-07 19:41                   ` Jason A. Donenfeld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).