Development discussion of WireGuard
 help / color / mirror / Atom feed
* [PATCH crypto 0/2] smaller blake2s code size on m68k and other small platforms
       [not found] <CAHmME9qbnYmhvsuarButi6s=58=FPiti0Z-QnGMJ=OsMzy1eOg@mail.gmail.com>
@ 2022-01-11 13:49 ` Jason A. Donenfeld
  2022-01-11 13:49   ` [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems Jason A. Donenfeld
                     ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-01-11 13:49 UTC (permalink / raw)
  To: linux-crypto, netdev, wireguard, linux-kernel, bpf, geert, tytso,
	gregkh, jeanphilippe.aumasson, ardb
  Cc: Jason A. Donenfeld

Hi,

Geert emailed me this afternoon concerned about blake2s codesize on m68k
and other small systems. We identified two extremely effective ways of
chopping down the size. One of them moves some wireguard-specific things
into wireguard proper. The other one adds a slower codepath for
CONFIG_CC_OPTIMIZE_FOR_SIZE configurations. I really don't like that
slower codepath, but since it is configuration gated, at least it stays
out of the way except for people who know they need a tiny kernel image

Thanks,
Jason

Jason A. Donenfeld (2):
  lib/crypto: blake2s-generic: reduce code size on small systems
  lib/crypto: blake2s: move hmac construction into wireguard

 drivers/net/wireguard/noise.c | 45 ++++++++++++++++++++++++++++++-----
 include/crypto/blake2s.h      |  3 ---
 lib/crypto/blake2s-generic.c  | 30 +++++++++++++----------
 lib/crypto/blake2s-selftest.c | 31 ------------------------
 lib/crypto/blake2s.c          | 37 ----------------------------
 5 files changed, 57 insertions(+), 89 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems
  2022-01-11 13:49 ` [PATCH crypto 0/2] smaller blake2s code size on m68k and other small platforms Jason A. Donenfeld
@ 2022-01-11 13:49   ` Jason A. Donenfeld
  2022-01-12 10:57     ` Geert Uytterhoeven
  2022-01-12 18:31     ` Eric Biggers
  2022-01-11 13:49   ` [PATCH crypto 2/2] lib/crypto: blake2s: move hmac construction into wireguard Jason A. Donenfeld
  2022-01-11 18:10   ` [PATCH crypto v2 0/2] reduce code size from blake2s on m68k and other small platforms Jason A. Donenfeld
  2 siblings, 2 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-01-11 13:49 UTC (permalink / raw)
  To: linux-crypto, netdev, wireguard, linux-kernel, bpf, geert, tytso,
	gregkh, jeanphilippe.aumasson, ardb
  Cc: Jason A. Donenfeld, Herbert Xu

Re-wind the loops entirely on kernels optimized for code size. This is
really not good at all performance-wise. But on m68k, it shaves off 4k
of code size, which is apparently important.

Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 lib/crypto/blake2s-generic.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/lib/crypto/blake2s-generic.c b/lib/crypto/blake2s-generic.c
index 75ccb3e633e6..990f000e22ee 100644
--- a/lib/crypto/blake2s-generic.c
+++ b/lib/crypto/blake2s-generic.c
@@ -46,7 +46,7 @@ void blake2s_compress_generic(struct blake2s_state *state, const u8 *block,
 {
 	u32 m[16];
 	u32 v[16];
-	int i;
+	int i, j;
 
 	WARN_ON(IS_ENABLED(DEBUG) &&
 		(nblocks > 1 && inc != BLAKE2S_BLOCK_SIZE));
@@ -86,17 +86,23 @@ void blake2s_compress_generic(struct blake2s_state *state, const u8 *block,
 	G(r, 6, v[2], v[ 7], v[ 8], v[13]); \
 	G(r, 7, v[3], v[ 4], v[ 9], v[14]); \
 } while (0)
-		ROUND(0);
-		ROUND(1);
-		ROUND(2);
-		ROUND(3);
-		ROUND(4);
-		ROUND(5);
-		ROUND(6);
-		ROUND(7);
-		ROUND(8);
-		ROUND(9);
-
+		if (IS_ENABLED(CONFIG_CC_OPTIMIZE_FOR_SIZE)) {
+			for (i = 0; i < 10; ++i) {
+				for (j = 0; j < 8; ++j)
+					G(i, j, v[j % 4], v[((j + (j / 4)) % 4) + 4], v[((j + 2 * (j / 4)) % 4) + 8], v[((j + 3 * (j / 4)) % 4) + 12]);
+			}
+		} else {
+			ROUND(0);
+			ROUND(1);
+			ROUND(2);
+			ROUND(3);
+			ROUND(4);
+			ROUND(5);
+			ROUND(6);
+			ROUND(7);
+			ROUND(8);
+			ROUND(9);
+		}
 #undef G
 #undef ROUND
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH crypto 2/2] lib/crypto: blake2s: move hmac construction into wireguard
  2022-01-11 13:49 ` [PATCH crypto 0/2] smaller blake2s code size on m68k and other small platforms Jason A. Donenfeld
  2022-01-11 13:49   ` [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems Jason A. Donenfeld
@ 2022-01-11 13:49   ` Jason A. Donenfeld
  2022-01-11 14:43     ` Ard Biesheuvel
  2022-01-12 18:35     ` Eric Biggers
  2022-01-11 18:10   ` [PATCH crypto v2 0/2] reduce code size from blake2s on m68k and other small platforms Jason A. Donenfeld
  2 siblings, 2 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-01-11 13:49 UTC (permalink / raw)
  To: linux-crypto, netdev, wireguard, linux-kernel, bpf, geert, tytso,
	gregkh, jeanphilippe.aumasson, ardb
  Cc: Jason A. Donenfeld, Herbert Xu

Basically nobody should use blake2s in an HMAC construction; it already
has a keyed variant. But for unfortunately historical reasons, Noise,
used by WireGuard, uses HKDF quite strictly, which means we have to use
this. Because this really shouldn't be used by others, this commit moves
it into wireguard's noise.c locally, so that kernels that aren't using
WireGuard don't get this superfluous code baked in. On m68k systems,
this shaves off ~314 bytes.

Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: netdev@vger.kernel.org
Cc: wireguard@lists.zx2c4.com
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 drivers/net/wireguard/noise.c | 45 ++++++++++++++++++++++++++++++-----
 include/crypto/blake2s.h      |  3 ---
 lib/crypto/blake2s-selftest.c | 31 ------------------------
 lib/crypto/blake2s.c          | 37 ----------------------------
 4 files changed, 39 insertions(+), 77 deletions(-)

diff --git a/drivers/net/wireguard/noise.c b/drivers/net/wireguard/noise.c
index c0cfd9b36c0b..720952b92e78 100644
--- a/drivers/net/wireguard/noise.c
+++ b/drivers/net/wireguard/noise.c
@@ -302,6 +302,41 @@ void wg_noise_set_static_identity_private_key(
 		static_identity->static_public, private_key);
 }
 
+static void hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen, const size_t keylen)
+{
+	struct blake2s_state state;
+	u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 };
+	u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32));
+	int i;
+
+	if (keylen > BLAKE2S_BLOCK_SIZE) {
+		blake2s_init(&state, BLAKE2S_HASH_SIZE);
+		blake2s_update(&state, key, keylen);
+		blake2s_final(&state, x_key);
+	} else
+		memcpy(x_key, key, keylen);
+
+	for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
+		x_key[i] ^= 0x36;
+
+	blake2s_init(&state, BLAKE2S_HASH_SIZE);
+	blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
+	blake2s_update(&state, in, inlen);
+	blake2s_final(&state, i_hash);
+
+	for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
+		x_key[i] ^= 0x5c ^ 0x36;
+
+	blake2s_init(&state, BLAKE2S_HASH_SIZE);
+	blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
+	blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE);
+	blake2s_final(&state, i_hash);
+
+	memcpy(out, i_hash, BLAKE2S_HASH_SIZE);
+	memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE);
+	memzero_explicit(i_hash, BLAKE2S_HASH_SIZE);
+}
+
 /* This is Hugo Krawczyk's HKDF:
  *  - https://eprint.iacr.org/2010/264.pdf
  *  - https://tools.ietf.org/html/rfc5869
@@ -322,14 +357,14 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,
 		 ((third_len || third_dst) && (!second_len || !second_dst))));
 
 	/* Extract entropy from data into secret */
-	blake2s256_hmac(secret, data, chaining_key, data_len, NOISE_HASH_LEN);
+	hmac(secret, data, chaining_key, data_len, NOISE_HASH_LEN);
 
 	if (!first_dst || !first_len)
 		goto out;
 
 	/* Expand first key: key = secret, data = 0x1 */
 	output[0] = 1;
-	blake2s256_hmac(output, output, secret, 1, BLAKE2S_HASH_SIZE);
+	hmac(output, output, secret, 1, BLAKE2S_HASH_SIZE);
 	memcpy(first_dst, output, first_len);
 
 	if (!second_dst || !second_len)
@@ -337,8 +372,7 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,
 
 	/* Expand second key: key = secret, data = first-key || 0x2 */
 	output[BLAKE2S_HASH_SIZE] = 2;
-	blake2s256_hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1,
-			BLAKE2S_HASH_SIZE);
+	hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE);
 	memcpy(second_dst, output, second_len);
 
 	if (!third_dst || !third_len)
@@ -346,8 +380,7 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,
 
 	/* Expand third key: key = secret, data = second-key || 0x3 */
 	output[BLAKE2S_HASH_SIZE] = 3;
-	blake2s256_hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1,
-			BLAKE2S_HASH_SIZE);
+	hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE);
 	memcpy(third_dst, output, third_len);
 
 out:
diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
index bc3fb59442ce..4e30e1799e61 100644
--- a/include/crypto/blake2s.h
+++ b/include/crypto/blake2s.h
@@ -101,7 +101,4 @@ static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
 	blake2s_final(&state, out);
 }
 
-void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
-		     const size_t keylen);
-
 #endif /* _CRYPTO_BLAKE2S_H */
diff --git a/lib/crypto/blake2s-selftest.c b/lib/crypto/blake2s-selftest.c
index 5d9ea53be973..409e4b728770 100644
--- a/lib/crypto/blake2s-selftest.c
+++ b/lib/crypto/blake2s-selftest.c
@@ -15,7 +15,6 @@
  * #include <stdio.h>
  *
  * #include <openssl/evp.h>
- * #include <openssl/hmac.h>
  *
  * #define BLAKE2S_TESTVEC_COUNT	256
  *
@@ -58,16 +57,6 @@
  *	}
  *	printf("};\n\n");
  *
- *	printf("static const u8 blake2s_hmac_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {\n");
- *
- *	HMAC(EVP_blake2s256(), key, sizeof(key), buf, sizeof(buf), hash, NULL);
- *	print_vec(hash, BLAKE2S_OUTBYTES);
- *
- *	HMAC(EVP_blake2s256(), buf, sizeof(buf), key, sizeof(key), hash, NULL);
- *	print_vec(hash, BLAKE2S_OUTBYTES);
- *
- *	printf("};\n");
- *
  *	return 0;
  *}
  */
@@ -554,15 +543,6 @@ static const u8 blake2s_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {
     0xd6, 0x98, 0x6b, 0x07, 0x10, 0x65, 0x52, 0x65, },
 };
 
-static const u8 blake2s_hmac_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {
-  { 0xce, 0xe1, 0x57, 0x69, 0x82, 0xdc, 0xbf, 0x43, 0xad, 0x56, 0x4c, 0x70,
-    0xed, 0x68, 0x16, 0x96, 0xcf, 0xa4, 0x73, 0xe8, 0xe8, 0xfc, 0x32, 0x79,
-    0x08, 0x0a, 0x75, 0x82, 0xda, 0x3f, 0x05, 0x11, },
-  { 0x77, 0x2f, 0x0c, 0x71, 0x41, 0xf4, 0x4b, 0x2b, 0xb3, 0xc6, 0xb6, 0xf9,
-    0x60, 0xde, 0xe4, 0x52, 0x38, 0x66, 0xe8, 0xbf, 0x9b, 0x96, 0xc4, 0x9f,
-    0x60, 0xd9, 0x24, 0x37, 0x99, 0xd6, 0xec, 0x31, },
-};
-
 bool __init blake2s_selftest(void)
 {
 	u8 key[BLAKE2S_KEY_SIZE];
@@ -607,16 +587,5 @@ bool __init blake2s_selftest(void)
 		}
 	}
 
-	if (success) {
-		blake2s256_hmac(hash, buf, key, sizeof(buf), sizeof(key));
-		success &= !memcmp(hash, blake2s_hmac_testvecs[0], BLAKE2S_HASH_SIZE);
-
-		blake2s256_hmac(hash, key, buf, sizeof(key), sizeof(buf));
-		success &= !memcmp(hash, blake2s_hmac_testvecs[1], BLAKE2S_HASH_SIZE);
-
-		if (!success)
-			pr_err("blake2s256_hmac self-test: FAIL\n");
-	}
-
 	return success;
 }
diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
index 93f2ae051370..9364f79937b8 100644
--- a/lib/crypto/blake2s.c
+++ b/lib/crypto/blake2s.c
@@ -30,43 +30,6 @@ void blake2s_final(struct blake2s_state *state, u8 *out)
 }
 EXPORT_SYMBOL(blake2s_final);
 
-void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
-		     const size_t keylen)
-{
-	struct blake2s_state state;
-	u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 };
-	u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32));
-	int i;
-
-	if (keylen > BLAKE2S_BLOCK_SIZE) {
-		blake2s_init(&state, BLAKE2S_HASH_SIZE);
-		blake2s_update(&state, key, keylen);
-		blake2s_final(&state, x_key);
-	} else
-		memcpy(x_key, key, keylen);
-
-	for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
-		x_key[i] ^= 0x36;
-
-	blake2s_init(&state, BLAKE2S_HASH_SIZE);
-	blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
-	blake2s_update(&state, in, inlen);
-	blake2s_final(&state, i_hash);
-
-	for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
-		x_key[i] ^= 0x5c ^ 0x36;
-
-	blake2s_init(&state, BLAKE2S_HASH_SIZE);
-	blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
-	blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE);
-	blake2s_final(&state, i_hash);
-
-	memcpy(out, i_hash, BLAKE2S_HASH_SIZE);
-	memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE);
-	memzero_explicit(i_hash, BLAKE2S_HASH_SIZE);
-}
-EXPORT_SYMBOL(blake2s256_hmac);
-
 static int __init blake2s_mod_init(void)
 {
 	if (!IS_ENABLED(CONFIG_CRYPTO_MANAGER_DISABLE_TESTS) &&
-- 
2.34.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH crypto 2/2] lib/crypto: blake2s: move hmac construction into wireguard
  2022-01-11 13:49   ` [PATCH crypto 2/2] lib/crypto: blake2s: move hmac construction into wireguard Jason A. Donenfeld
@ 2022-01-11 14:43     ` Ard Biesheuvel
  2022-01-12 18:35     ` Eric Biggers
  1 sibling, 0 replies; 23+ messages in thread
From: Ard Biesheuvel @ 2022-01-11 14:43 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Linux Crypto Mailing List,
	open list:BPF JIT for MIPS (32-BIT AND 64-BIT),
	wireguard, Linux Kernel Mailing List,
	open list:BPF JIT for MIPS (32-BIT AND 64-BIT),
	Geert Uytterhoeven, Theodore Y. Ts'o, Greg Kroah-Hartman,
	Jean-Philippe Aumasson, Herbert Xu

On Tue, 11 Jan 2022 at 14:49, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Basically nobody should use blake2s in an HMAC construction; it already
> has a keyed variant. But for unfortunately historical reasons, Noise,

-ly

> used by WireGuard, uses HKDF quite strictly, which means we have to use
> this. Because this really shouldn't be used by others, this commit moves
> it into wireguard's noise.c locally, so that kernels that aren't using
> WireGuard don't get this superfluous code baked in. On m68k systems,
> this shaves off ~314 bytes.
>
> Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: netdev@vger.kernel.org
> Cc: wireguard@lists.zx2c4.com
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  drivers/net/wireguard/noise.c | 45 ++++++++++++++++++++++++++++++-----
>  include/crypto/blake2s.h      |  3 ---
>  lib/crypto/blake2s-selftest.c | 31 ------------------------
>  lib/crypto/blake2s.c          | 37 ----------------------------
>  4 files changed, 39 insertions(+), 77 deletions(-)
>
> diff --git a/drivers/net/wireguard/noise.c b/drivers/net/wireguard/noise.c
> index c0cfd9b36c0b..720952b92e78 100644
> --- a/drivers/net/wireguard/noise.c
> +++ b/drivers/net/wireguard/noise.c
> @@ -302,6 +302,41 @@ void wg_noise_set_static_identity_private_key(
>                 static_identity->static_public, private_key);
>  }
>
> +static void hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen, const size_t keylen)
> +{
> +       struct blake2s_state state;
> +       u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 };
> +       u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32));
> +       int i;
> +
> +       if (keylen > BLAKE2S_BLOCK_SIZE) {
> +               blake2s_init(&state, BLAKE2S_HASH_SIZE);
> +               blake2s_update(&state, key, keylen);
> +               blake2s_final(&state, x_key);
> +       } else
> +               memcpy(x_key, key, keylen);
> +
> +       for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
> +               x_key[i] ^= 0x36;
> +
> +       blake2s_init(&state, BLAKE2S_HASH_SIZE);
> +       blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
> +       blake2s_update(&state, in, inlen);
> +       blake2s_final(&state, i_hash);
> +
> +       for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
> +               x_key[i] ^= 0x5c ^ 0x36;
> +
> +       blake2s_init(&state, BLAKE2S_HASH_SIZE);
> +       blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
> +       blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE);
> +       blake2s_final(&state, i_hash);
> +
> +       memcpy(out, i_hash, BLAKE2S_HASH_SIZE);
> +       memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE);
> +       memzero_explicit(i_hash, BLAKE2S_HASH_SIZE);
> +}
> +
>  /* This is Hugo Krawczyk's HKDF:
>   *  - https://eprint.iacr.org/2010/264.pdf
>   *  - https://tools.ietf.org/html/rfc5869
> @@ -322,14 +357,14 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,
>                  ((third_len || third_dst) && (!second_len || !second_dst))));
>
>         /* Extract entropy from data into secret */
> -       blake2s256_hmac(secret, data, chaining_key, data_len, NOISE_HASH_LEN);
> +       hmac(secret, data, chaining_key, data_len, NOISE_HASH_LEN);
>
>         if (!first_dst || !first_len)
>                 goto out;
>
>         /* Expand first key: key = secret, data = 0x1 */
>         output[0] = 1;
> -       blake2s256_hmac(output, output, secret, 1, BLAKE2S_HASH_SIZE);
> +       hmac(output, output, secret, 1, BLAKE2S_HASH_SIZE);
>         memcpy(first_dst, output, first_len);
>
>         if (!second_dst || !second_len)
> @@ -337,8 +372,7 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,
>
>         /* Expand second key: key = secret, data = first-key || 0x2 */
>         output[BLAKE2S_HASH_SIZE] = 2;
> -       blake2s256_hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1,
> -                       BLAKE2S_HASH_SIZE);
> +       hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE);
>         memcpy(second_dst, output, second_len);
>
>         if (!third_dst || !third_len)
> @@ -346,8 +380,7 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,
>
>         /* Expand third key: key = secret, data = second-key || 0x3 */
>         output[BLAKE2S_HASH_SIZE] = 3;
> -       blake2s256_hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1,
> -                       BLAKE2S_HASH_SIZE);
> +       hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE);
>         memcpy(third_dst, output, third_len);
>
>  out:
> diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
> index bc3fb59442ce..4e30e1799e61 100644
> --- a/include/crypto/blake2s.h
> +++ b/include/crypto/blake2s.h
> @@ -101,7 +101,4 @@ static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
>         blake2s_final(&state, out);
>  }
>
> -void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
> -                    const size_t keylen);
> -
>  #endif /* _CRYPTO_BLAKE2S_H */
> diff --git a/lib/crypto/blake2s-selftest.c b/lib/crypto/blake2s-selftest.c
> index 5d9ea53be973..409e4b728770 100644
> --- a/lib/crypto/blake2s-selftest.c
> +++ b/lib/crypto/blake2s-selftest.c
> @@ -15,7 +15,6 @@
>   * #include <stdio.h>
>   *
>   * #include <openssl/evp.h>
> - * #include <openssl/hmac.h>
>   *
>   * #define BLAKE2S_TESTVEC_COUNT       256
>   *
> @@ -58,16 +57,6 @@
>   *     }
>   *     printf("};\n\n");
>   *
> - *     printf("static const u8 blake2s_hmac_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {\n");
> - *
> - *     HMAC(EVP_blake2s256(), key, sizeof(key), buf, sizeof(buf), hash, NULL);
> - *     print_vec(hash, BLAKE2S_OUTBYTES);
> - *
> - *     HMAC(EVP_blake2s256(), buf, sizeof(buf), key, sizeof(key), hash, NULL);
> - *     print_vec(hash, BLAKE2S_OUTBYTES);
> - *
> - *     printf("};\n");
> - *
>   *     return 0;
>   *}
>   */
> @@ -554,15 +543,6 @@ static const u8 blake2s_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {
>      0xd6, 0x98, 0x6b, 0x07, 0x10, 0x65, 0x52, 0x65, },
>  };
>
> -static const u8 blake2s_hmac_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {
> -  { 0xce, 0xe1, 0x57, 0x69, 0x82, 0xdc, 0xbf, 0x43, 0xad, 0x56, 0x4c, 0x70,
> -    0xed, 0x68, 0x16, 0x96, 0xcf, 0xa4, 0x73, 0xe8, 0xe8, 0xfc, 0x32, 0x79,
> -    0x08, 0x0a, 0x75, 0x82, 0xda, 0x3f, 0x05, 0x11, },
> -  { 0x77, 0x2f, 0x0c, 0x71, 0x41, 0xf4, 0x4b, 0x2b, 0xb3, 0xc6, 0xb6, 0xf9,
> -    0x60, 0xde, 0xe4, 0x52, 0x38, 0x66, 0xe8, 0xbf, 0x9b, 0x96, 0xc4, 0x9f,
> -    0x60, 0xd9, 0x24, 0x37, 0x99, 0xd6, 0xec, 0x31, },
> -};
> -
>  bool __init blake2s_selftest(void)
>  {
>         u8 key[BLAKE2S_KEY_SIZE];
> @@ -607,16 +587,5 @@ bool __init blake2s_selftest(void)
>                 }
>         }
>
> -       if (success) {
> -               blake2s256_hmac(hash, buf, key, sizeof(buf), sizeof(key));
> -               success &= !memcmp(hash, blake2s_hmac_testvecs[0], BLAKE2S_HASH_SIZE);
> -
> -               blake2s256_hmac(hash, key, buf, sizeof(key), sizeof(buf));
> -               success &= !memcmp(hash, blake2s_hmac_testvecs[1], BLAKE2S_HASH_SIZE);
> -
> -               if (!success)
> -                       pr_err("blake2s256_hmac self-test: FAIL\n");
> -       }
> -
>         return success;
>  }
> diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
> index 93f2ae051370..9364f79937b8 100644
> --- a/lib/crypto/blake2s.c
> +++ b/lib/crypto/blake2s.c
> @@ -30,43 +30,6 @@ void blake2s_final(struct blake2s_state *state, u8 *out)
>  }
>  EXPORT_SYMBOL(blake2s_final);
>
> -void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
> -                    const size_t keylen)
> -{
> -       struct blake2s_state state;
> -       u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 };
> -       u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32));
> -       int i;
> -
> -       if (keylen > BLAKE2S_BLOCK_SIZE) {
> -               blake2s_init(&state, BLAKE2S_HASH_SIZE);
> -               blake2s_update(&state, key, keylen);
> -               blake2s_final(&state, x_key);
> -       } else
> -               memcpy(x_key, key, keylen);
> -
> -       for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
> -               x_key[i] ^= 0x36;
> -
> -       blake2s_init(&state, BLAKE2S_HASH_SIZE);
> -       blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
> -       blake2s_update(&state, in, inlen);
> -       blake2s_final(&state, i_hash);
> -
> -       for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
> -               x_key[i] ^= 0x5c ^ 0x36;
> -
> -       blake2s_init(&state, BLAKE2S_HASH_SIZE);
> -       blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
> -       blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE);
> -       blake2s_final(&state, i_hash);
> -
> -       memcpy(out, i_hash, BLAKE2S_HASH_SIZE);
> -       memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE);
> -       memzero_explicit(i_hash, BLAKE2S_HASH_SIZE);
> -}
> -EXPORT_SYMBOL(blake2s256_hmac);
> -
>  static int __init blake2s_mod_init(void)
>  {
>         if (!IS_ENABLED(CONFIG_CRYPTO_MANAGER_DISABLE_TESTS) &&
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH crypto v2 0/2] reduce code size from blake2s on m68k and other small platforms
  2022-01-11 13:49 ` [PATCH crypto 0/2] smaller blake2s code size on m68k and other small platforms Jason A. Donenfeld
  2022-01-11 13:49   ` [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems Jason A. Donenfeld
  2022-01-11 13:49   ` [PATCH crypto 2/2] lib/crypto: blake2s: move hmac construction into wireguard Jason A. Donenfeld
@ 2022-01-11 18:10   ` Jason A. Donenfeld
  2022-01-11 18:10     ` [PATCH crypto v2 1/2] lib/crypto: blake2s: move hmac construction into wireguard Jason A. Donenfeld
                       ` (2 more replies)
  2 siblings, 3 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-01-11 18:10 UTC (permalink / raw)
  To: linux-crypto, netdev, wireguard, linux-kernel, bpf, geert, tytso,
	gregkh, jeanphilippe.aumasson, ardb
  Cc: Jason A. Donenfeld

Hi,

Geert emailed me this afternoon concerned about blake2s codesize on m68k
and other small systems. We identified two effective ways of chopping
down the size. One of them moves some wireguard-specific things into
wireguard proper. The other one adds a slower codepath for small
machines to blake2s. This worked, and was v1 of this patchset, but I
wasn't so much of a fan. Then someone pointed out that the generic C
SHA-1 implementation is still unrolled, which is a *lot* of extra code.
Simply rerolling that saves about as much as v1 did. So, we instead do
that in this v2 patchset. SHA-1 is being phased out, and soon it won't
be included at all (hopefully). And nothing performance-oriented has
anything to do with it anyway.

The result of these two patches mitigates Geert's feared code size
increase for 5.17.

Thanks,
Jason


Jason A. Donenfeld (2):
  lib/crypto: blake2s: move hmac construction into wireguard
  lib/crypto: sha1: re-roll loops to reduce code size

 drivers/net/wireguard/noise.c |  45 +++++++++++--
 include/crypto/blake2s.h      |   3 -
 lib/crypto/blake2s-selftest.c |  31 ---------
 lib/crypto/blake2s.c          |  37 -----------
 lib/sha1.c                    | 117 ++++++++--------------------------
 5 files changed, 64 insertions(+), 169 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH crypto v2 1/2] lib/crypto: blake2s: move hmac construction into wireguard
  2022-01-11 18:10   ` [PATCH crypto v2 0/2] reduce code size from blake2s on m68k and other small platforms Jason A. Donenfeld
@ 2022-01-11 18:10     ` Jason A. Donenfeld
  2022-01-11 18:10     ` [PATCH crypto v2 2/2] lib/crypto: sha1: re-roll loops to reduce code size Jason A. Donenfeld
  2022-01-11 22:05     ` [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms Jason A. Donenfeld
  2 siblings, 0 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-01-11 18:10 UTC (permalink / raw)
  To: linux-crypto, netdev, wireguard, linux-kernel, bpf, geert, tytso,
	gregkh, jeanphilippe.aumasson, ardb
  Cc: Jason A. Donenfeld, Herbert Xu

Basically nobody should use blake2s in an HMAC construction; it already
has a keyed variant. But unfortunately for historical reasons, Noise,
used by WireGuard, uses HKDF quite strictly, which means we have to use
this. Because this really shouldn't be used by others, this commit moves
it into wireguard's noise.c locally, so that kernels that aren't using
WireGuard don't get this superfluous code baked in. On m68k systems,
this shaves off ~314 bytes.

Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 drivers/net/wireguard/noise.c | 45 ++++++++++++++++++++++++++++++-----
 include/crypto/blake2s.h      |  3 ---
 lib/crypto/blake2s-selftest.c | 31 ------------------------
 lib/crypto/blake2s.c          | 37 ----------------------------
 4 files changed, 39 insertions(+), 77 deletions(-)

diff --git a/drivers/net/wireguard/noise.c b/drivers/net/wireguard/noise.c
index c0cfd9b36c0b..720952b92e78 100644
--- a/drivers/net/wireguard/noise.c
+++ b/drivers/net/wireguard/noise.c
@@ -302,6 +302,41 @@ void wg_noise_set_static_identity_private_key(
 		static_identity->static_public, private_key);
 }
 
+static void hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen, const size_t keylen)
+{
+	struct blake2s_state state;
+	u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 };
+	u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32));
+	int i;
+
+	if (keylen > BLAKE2S_BLOCK_SIZE) {
+		blake2s_init(&state, BLAKE2S_HASH_SIZE);
+		blake2s_update(&state, key, keylen);
+		blake2s_final(&state, x_key);
+	} else
+		memcpy(x_key, key, keylen);
+
+	for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
+		x_key[i] ^= 0x36;
+
+	blake2s_init(&state, BLAKE2S_HASH_SIZE);
+	blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
+	blake2s_update(&state, in, inlen);
+	blake2s_final(&state, i_hash);
+
+	for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
+		x_key[i] ^= 0x5c ^ 0x36;
+
+	blake2s_init(&state, BLAKE2S_HASH_SIZE);
+	blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
+	blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE);
+	blake2s_final(&state, i_hash);
+
+	memcpy(out, i_hash, BLAKE2S_HASH_SIZE);
+	memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE);
+	memzero_explicit(i_hash, BLAKE2S_HASH_SIZE);
+}
+
 /* This is Hugo Krawczyk's HKDF:
  *  - https://eprint.iacr.org/2010/264.pdf
  *  - https://tools.ietf.org/html/rfc5869
@@ -322,14 +357,14 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,
 		 ((third_len || third_dst) && (!second_len || !second_dst))));
 
 	/* Extract entropy from data into secret */
-	blake2s256_hmac(secret, data, chaining_key, data_len, NOISE_HASH_LEN);
+	hmac(secret, data, chaining_key, data_len, NOISE_HASH_LEN);
 
 	if (!first_dst || !first_len)
 		goto out;
 
 	/* Expand first key: key = secret, data = 0x1 */
 	output[0] = 1;
-	blake2s256_hmac(output, output, secret, 1, BLAKE2S_HASH_SIZE);
+	hmac(output, output, secret, 1, BLAKE2S_HASH_SIZE);
 	memcpy(first_dst, output, first_len);
 
 	if (!second_dst || !second_len)
@@ -337,8 +372,7 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,
 
 	/* Expand second key: key = secret, data = first-key || 0x2 */
 	output[BLAKE2S_HASH_SIZE] = 2;
-	blake2s256_hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1,
-			BLAKE2S_HASH_SIZE);
+	hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE);
 	memcpy(second_dst, output, second_len);
 
 	if (!third_dst || !third_len)
@@ -346,8 +380,7 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,
 
 	/* Expand third key: key = secret, data = second-key || 0x3 */
 	output[BLAKE2S_HASH_SIZE] = 3;
-	blake2s256_hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1,
-			BLAKE2S_HASH_SIZE);
+	hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE);
 	memcpy(third_dst, output, third_len);
 
 out:
diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
index bc3fb59442ce..4e30e1799e61 100644
--- a/include/crypto/blake2s.h
+++ b/include/crypto/blake2s.h
@@ -101,7 +101,4 @@ static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
 	blake2s_final(&state, out);
 }
 
-void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
-		     const size_t keylen);
-
 #endif /* _CRYPTO_BLAKE2S_H */
diff --git a/lib/crypto/blake2s-selftest.c b/lib/crypto/blake2s-selftest.c
index 5d9ea53be973..409e4b728770 100644
--- a/lib/crypto/blake2s-selftest.c
+++ b/lib/crypto/blake2s-selftest.c
@@ -15,7 +15,6 @@
  * #include <stdio.h>
  *
  * #include <openssl/evp.h>
- * #include <openssl/hmac.h>
  *
  * #define BLAKE2S_TESTVEC_COUNT	256
  *
@@ -58,16 +57,6 @@
  *	}
  *	printf("};\n\n");
  *
- *	printf("static const u8 blake2s_hmac_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {\n");
- *
- *	HMAC(EVP_blake2s256(), key, sizeof(key), buf, sizeof(buf), hash, NULL);
- *	print_vec(hash, BLAKE2S_OUTBYTES);
- *
- *	HMAC(EVP_blake2s256(), buf, sizeof(buf), key, sizeof(key), hash, NULL);
- *	print_vec(hash, BLAKE2S_OUTBYTES);
- *
- *	printf("};\n");
- *
  *	return 0;
  *}
  */
@@ -554,15 +543,6 @@ static const u8 blake2s_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {
     0xd6, 0x98, 0x6b, 0x07, 0x10, 0x65, 0x52, 0x65, },
 };
 
-static const u8 blake2s_hmac_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {
-  { 0xce, 0xe1, 0x57, 0x69, 0x82, 0xdc, 0xbf, 0x43, 0xad, 0x56, 0x4c, 0x70,
-    0xed, 0x68, 0x16, 0x96, 0xcf, 0xa4, 0x73, 0xe8, 0xe8, 0xfc, 0x32, 0x79,
-    0x08, 0x0a, 0x75, 0x82, 0xda, 0x3f, 0x05, 0x11, },
-  { 0x77, 0x2f, 0x0c, 0x71, 0x41, 0xf4, 0x4b, 0x2b, 0xb3, 0xc6, 0xb6, 0xf9,
-    0x60, 0xde, 0xe4, 0x52, 0x38, 0x66, 0xe8, 0xbf, 0x9b, 0x96, 0xc4, 0x9f,
-    0x60, 0xd9, 0x24, 0x37, 0x99, 0xd6, 0xec, 0x31, },
-};
-
 bool __init blake2s_selftest(void)
 {
 	u8 key[BLAKE2S_KEY_SIZE];
@@ -607,16 +587,5 @@ bool __init blake2s_selftest(void)
 		}
 	}
 
-	if (success) {
-		blake2s256_hmac(hash, buf, key, sizeof(buf), sizeof(key));
-		success &= !memcmp(hash, blake2s_hmac_testvecs[0], BLAKE2S_HASH_SIZE);
-
-		blake2s256_hmac(hash, key, buf, sizeof(key), sizeof(buf));
-		success &= !memcmp(hash, blake2s_hmac_testvecs[1], BLAKE2S_HASH_SIZE);
-
-		if (!success)
-			pr_err("blake2s256_hmac self-test: FAIL\n");
-	}
-
 	return success;
 }
diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
index 93f2ae051370..9364f79937b8 100644
--- a/lib/crypto/blake2s.c
+++ b/lib/crypto/blake2s.c
@@ -30,43 +30,6 @@ void blake2s_final(struct blake2s_state *state, u8 *out)
 }
 EXPORT_SYMBOL(blake2s_final);
 
-void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
-		     const size_t keylen)
-{
-	struct blake2s_state state;
-	u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 };
-	u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32));
-	int i;
-
-	if (keylen > BLAKE2S_BLOCK_SIZE) {
-		blake2s_init(&state, BLAKE2S_HASH_SIZE);
-		blake2s_update(&state, key, keylen);
-		blake2s_final(&state, x_key);
-	} else
-		memcpy(x_key, key, keylen);
-
-	for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
-		x_key[i] ^= 0x36;
-
-	blake2s_init(&state, BLAKE2S_HASH_SIZE);
-	blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
-	blake2s_update(&state, in, inlen);
-	blake2s_final(&state, i_hash);
-
-	for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
-		x_key[i] ^= 0x5c ^ 0x36;
-
-	blake2s_init(&state, BLAKE2S_HASH_SIZE);
-	blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
-	blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE);
-	blake2s_final(&state, i_hash);
-
-	memcpy(out, i_hash, BLAKE2S_HASH_SIZE);
-	memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE);
-	memzero_explicit(i_hash, BLAKE2S_HASH_SIZE);
-}
-EXPORT_SYMBOL(blake2s256_hmac);
-
 static int __init blake2s_mod_init(void)
 {
 	if (!IS_ENABLED(CONFIG_CRYPTO_MANAGER_DISABLE_TESTS) &&
-- 
2.34.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH crypto v2 2/2] lib/crypto: sha1: re-roll loops to reduce code size
  2022-01-11 18:10   ` [PATCH crypto v2 0/2] reduce code size from blake2s on m68k and other small platforms Jason A. Donenfeld
  2022-01-11 18:10     ` [PATCH crypto v2 1/2] lib/crypto: blake2s: move hmac construction into wireguard Jason A. Donenfeld
@ 2022-01-11 18:10     ` Jason A. Donenfeld
  2022-01-11 22:05     ` [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms Jason A. Donenfeld
  2 siblings, 0 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-01-11 18:10 UTC (permalink / raw)
  To: linux-crypto, netdev, wireguard, linux-kernel, bpf, geert, tytso,
	gregkh, jeanphilippe.aumasson, ardb
  Cc: Jason A. Donenfeld, Herbert Xu

With SHA-1 no longer being used for anything performance oriented, and
also soon to be phased out entirely, we can make up for the space added
by unrolled BLAKE2s by simply re-rolling SHA-1. Since SHA-1 is so much
more complex, re-rolling it more or less takes care of the code size
added by BLAKE2s. And eventually, hopefully we'll see SHA-1 removed
entirely from most small kernel builds.

Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 lib/sha1.c | 117 ++++++++++++-----------------------------------------
 1 file changed, 25 insertions(+), 92 deletions(-)

diff --git a/lib/sha1.c b/lib/sha1.c
index 9bd1935a1472..f2acfa294e64 100644
--- a/lib/sha1.c
+++ b/lib/sha1.c
@@ -9,6 +9,7 @@
 #include <linux/kernel.h>
 #include <linux/export.h>
 #include <linux/bitops.h>
+#include <linux/string.h>
 #include <crypto/sha1.h>
 #include <asm/unaligned.h>
 
@@ -83,109 +84,41 @@
  */
 void sha1_transform(__u32 *digest, const char *data, __u32 *array)
 {
-	__u32 A, B, C, D, E;
+	u32 d[5];
+	unsigned int i = 0;
 
-	A = digest[0];
-	B = digest[1];
-	C = digest[2];
-	D = digest[3];
-	E = digest[4];
+	memcpy(d, digest, sizeof(d));
 
 	/* Round 1 - iterations 0-16 take their input from 'data' */
-	T_0_15( 0, A, B, C, D, E);
-	T_0_15( 1, E, A, B, C, D);
-	T_0_15( 2, D, E, A, B, C);
-	T_0_15( 3, C, D, E, A, B);
-	T_0_15( 4, B, C, D, E, A);
-	T_0_15( 5, A, B, C, D, E);
-	T_0_15( 6, E, A, B, C, D);
-	T_0_15( 7, D, E, A, B, C);
-	T_0_15( 8, C, D, E, A, B);
-	T_0_15( 9, B, C, D, E, A);
-	T_0_15(10, A, B, C, D, E);
-	T_0_15(11, E, A, B, C, D);
-	T_0_15(12, D, E, A, B, C);
-	T_0_15(13, C, D, E, A, B);
-	T_0_15(14, B, C, D, E, A);
-	T_0_15(15, A, B, C, D, E);
+	for (; i < 16; ++i)
+		T_0_15(i, d[(-6 - i) % 5], d[(-5 - i) % 5],
+		       d[(-4 - i) % 5], d[(-3 - i) % 5], d[(-2 - i) % 5]);
 
 	/* Round 1 - tail. Input from 512-bit mixing array */
-	T_16_19(16, E, A, B, C, D);
-	T_16_19(17, D, E, A, B, C);
-	T_16_19(18, C, D, E, A, B);
-	T_16_19(19, B, C, D, E, A);
+	for (; i < 20; ++i)
+		T_16_19(i, d[(-6 - i) % 5], d[(-5 - i) % 5],
+			d[(-4 - i) % 5], d[(-3 - i) % 5], d[(-2 - i) % 5]);
 
 	/* Round 2 */
-	T_20_39(20, A, B, C, D, E);
-	T_20_39(21, E, A, B, C, D);
-	T_20_39(22, D, E, A, B, C);
-	T_20_39(23, C, D, E, A, B);
-	T_20_39(24, B, C, D, E, A);
-	T_20_39(25, A, B, C, D, E);
-	T_20_39(26, E, A, B, C, D);
-	T_20_39(27, D, E, A, B, C);
-	T_20_39(28, C, D, E, A, B);
-	T_20_39(29, B, C, D, E, A);
-	T_20_39(30, A, B, C, D, E);
-	T_20_39(31, E, A, B, C, D);
-	T_20_39(32, D, E, A, B, C);
-	T_20_39(33, C, D, E, A, B);
-	T_20_39(34, B, C, D, E, A);
-	T_20_39(35, A, B, C, D, E);
-	T_20_39(36, E, A, B, C, D);
-	T_20_39(37, D, E, A, B, C);
-	T_20_39(38, C, D, E, A, B);
-	T_20_39(39, B, C, D, E, A);
+	for (; i < 40; ++i)
+		T_20_39(i, d[(-6 - i) % 5], d[(-5 - i) % 5],
+			d[(-4 - i) % 5], d[(-3 - i) % 5], d[(-2 - i) % 5]);
 
 	/* Round 3 */
-	T_40_59(40, A, B, C, D, E);
-	T_40_59(41, E, A, B, C, D);
-	T_40_59(42, D, E, A, B, C);
-	T_40_59(43, C, D, E, A, B);
-	T_40_59(44, B, C, D, E, A);
-	T_40_59(45, A, B, C, D, E);
-	T_40_59(46, E, A, B, C, D);
-	T_40_59(47, D, E, A, B, C);
-	T_40_59(48, C, D, E, A, B);
-	T_40_59(49, B, C, D, E, A);
-	T_40_59(50, A, B, C, D, E);
-	T_40_59(51, E, A, B, C, D);
-	T_40_59(52, D, E, A, B, C);
-	T_40_59(53, C, D, E, A, B);
-	T_40_59(54, B, C, D, E, A);
-	T_40_59(55, A, B, C, D, E);
-	T_40_59(56, E, A, B, C, D);
-	T_40_59(57, D, E, A, B, C);
-	T_40_59(58, C, D, E, A, B);
-	T_40_59(59, B, C, D, E, A);
+	for (; i < 60; ++i)
+		T_40_59(i, d[(-6 - i) % 5], d[(-5 - i) % 5],
+			d[(-4 - i) % 5], d[(-3 - i) % 5], d[(-2 - i) % 5]);
 
 	/* Round 4 */
-	T_60_79(60, A, B, C, D, E);
-	T_60_79(61, E, A, B, C, D);
-	T_60_79(62, D, E, A, B, C);
-	T_60_79(63, C, D, E, A, B);
-	T_60_79(64, B, C, D, E, A);
-	T_60_79(65, A, B, C, D, E);
-	T_60_79(66, E, A, B, C, D);
-	T_60_79(67, D, E, A, B, C);
-	T_60_79(68, C, D, E, A, B);
-	T_60_79(69, B, C, D, E, A);
-	T_60_79(70, A, B, C, D, E);
-	T_60_79(71, E, A, B, C, D);
-	T_60_79(72, D, E, A, B, C);
-	T_60_79(73, C, D, E, A, B);
-	T_60_79(74, B, C, D, E, A);
-	T_60_79(75, A, B, C, D, E);
-	T_60_79(76, E, A, B, C, D);
-	T_60_79(77, D, E, A, B, C);
-	T_60_79(78, C, D, E, A, B);
-	T_60_79(79, B, C, D, E, A);
-
-	digest[0] += A;
-	digest[1] += B;
-	digest[2] += C;
-	digest[3] += D;
-	digest[4] += E;
+	for (; i < 80; ++i)
+		T_60_79(i, d[(-6 - i) % 5], d[(-5 - i) % 5],
+			d[(-4 - i) % 5], d[(-3 - i) % 5], d[(-2 - i) % 5]);
+
+	digest[0] += d[0];
+	digest[1] += d[1];
+	digest[2] += d[2];
+	digest[3] += d[3];
+	digest[4] += d[4];
 }
 EXPORT_SYMBOL(sha1_transform);
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms
  2022-01-11 18:10   ` [PATCH crypto v2 0/2] reduce code size from blake2s on m68k and other small platforms Jason A. Donenfeld
  2022-01-11 18:10     ` [PATCH crypto v2 1/2] lib/crypto: blake2s: move hmac construction into wireguard Jason A. Donenfeld
  2022-01-11 18:10     ` [PATCH crypto v2 2/2] lib/crypto: sha1: re-roll loops to reduce code size Jason A. Donenfeld
@ 2022-01-11 22:05     ` Jason A. Donenfeld
  2022-01-11 22:05       ` [PATCH crypto v3 1/2] lib/crypto: blake2s: move hmac construction into wireguard Jason A. Donenfeld
                         ` (2 more replies)
  2 siblings, 3 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-01-11 22:05 UTC (permalink / raw)
  To: linux-crypto, netdev, wireguard, linux-kernel, bpf, geert, tytso,
	gregkh, jeanphilippe.aumasson, ardb
  Cc: Jason A. Donenfeld

Hi,

Geert emailed me this afternoon concerned about blake2s codesize on m68k
and other small systems. We identified two effective ways of chopping
down the size. One of them moves some wireguard-specific things into
wireguard proper. The other one adds a slower codepath for small
machines to blake2s. This worked, and was v1 of this patchset, but I
wasn't so much of a fan. Then someone pointed out that the generic C
SHA-1 implementation is still unrolled, which is a *lot* of extra code.
Simply rerolling that saves about as much as v1 did. So, we instead do
that in this patchset. SHA-1 is being phased out, and soon it won't
be included at all (hopefully). And nothing performance-oriented has
anything to do with it anyway.

The result of these two patches mitigates Geert's feared code size
increase for 5.17.

v3 improves on v2 by making the re-rolling of SHA-1 much simpler,
resulting in even larger code size reduction and much better
performance. The reason I'm sending yet a third version in such a short
amount of time is because the trick here feels obvious and substantial
enough that I'd hate for Geert to waste time measuring the impact of the
previous commit.

Thanks,
Jason

Jason A. Donenfeld (2):
  lib/crypto: blake2s: move hmac construction into wireguard
  lib/crypto: sha1: re-roll loops to reduce code size

 drivers/net/wireguard/noise.c | 45 ++++++++++++++---
 include/crypto/blake2s.h      |  3 --
 lib/crypto/blake2s-selftest.c | 31 ------------
 lib/crypto/blake2s.c          | 37 --------------
 lib/sha1.c                    | 95 ++++++-----------------------------
 5 files changed, 53 insertions(+), 158 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH crypto v3 1/2] lib/crypto: blake2s: move hmac construction into wireguard
  2022-01-11 22:05     ` [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms Jason A. Donenfeld
@ 2022-01-11 22:05       ` Jason A. Donenfeld
  2022-01-11 22:05       ` [PATCH crypto v3 2/2] lib/crypto: sha1: re-roll loops to reduce code size Jason A. Donenfeld
  2022-01-12 10:59       ` [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms Geert Uytterhoeven
  2 siblings, 0 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-01-11 22:05 UTC (permalink / raw)
  To: linux-crypto, netdev, wireguard, linux-kernel, bpf, geert, tytso,
	gregkh, jeanphilippe.aumasson, ardb
  Cc: Jason A. Donenfeld, Herbert Xu

Basically nobody should use blake2s in an HMAC construction; it already
has a keyed variant. But unfortunately for historical reasons, Noise,
used by WireGuard, uses HKDF quite strictly, which means we have to use
this. Because this really shouldn't be used by others, this commit moves
it into wireguard's noise.c locally, so that kernels that aren't using
WireGuard don't get this superfluous code baked in. On m68k systems,
this shaves off ~314 bytes.

Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 drivers/net/wireguard/noise.c | 45 ++++++++++++++++++++++++++++++-----
 include/crypto/blake2s.h      |  3 ---
 lib/crypto/blake2s-selftest.c | 31 ------------------------
 lib/crypto/blake2s.c          | 37 ----------------------------
 4 files changed, 39 insertions(+), 77 deletions(-)

diff --git a/drivers/net/wireguard/noise.c b/drivers/net/wireguard/noise.c
index c0cfd9b36c0b..720952b92e78 100644
--- a/drivers/net/wireguard/noise.c
+++ b/drivers/net/wireguard/noise.c
@@ -302,6 +302,41 @@ void wg_noise_set_static_identity_private_key(
 		static_identity->static_public, private_key);
 }
 
+static void hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen, const size_t keylen)
+{
+	struct blake2s_state state;
+	u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 };
+	u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32));
+	int i;
+
+	if (keylen > BLAKE2S_BLOCK_SIZE) {
+		blake2s_init(&state, BLAKE2S_HASH_SIZE);
+		blake2s_update(&state, key, keylen);
+		blake2s_final(&state, x_key);
+	} else
+		memcpy(x_key, key, keylen);
+
+	for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
+		x_key[i] ^= 0x36;
+
+	blake2s_init(&state, BLAKE2S_HASH_SIZE);
+	blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
+	blake2s_update(&state, in, inlen);
+	blake2s_final(&state, i_hash);
+
+	for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
+		x_key[i] ^= 0x5c ^ 0x36;
+
+	blake2s_init(&state, BLAKE2S_HASH_SIZE);
+	blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
+	blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE);
+	blake2s_final(&state, i_hash);
+
+	memcpy(out, i_hash, BLAKE2S_HASH_SIZE);
+	memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE);
+	memzero_explicit(i_hash, BLAKE2S_HASH_SIZE);
+}
+
 /* This is Hugo Krawczyk's HKDF:
  *  - https://eprint.iacr.org/2010/264.pdf
  *  - https://tools.ietf.org/html/rfc5869
@@ -322,14 +357,14 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,
 		 ((third_len || third_dst) && (!second_len || !second_dst))));
 
 	/* Extract entropy from data into secret */
-	blake2s256_hmac(secret, data, chaining_key, data_len, NOISE_HASH_LEN);
+	hmac(secret, data, chaining_key, data_len, NOISE_HASH_LEN);
 
 	if (!first_dst || !first_len)
 		goto out;
 
 	/* Expand first key: key = secret, data = 0x1 */
 	output[0] = 1;
-	blake2s256_hmac(output, output, secret, 1, BLAKE2S_HASH_SIZE);
+	hmac(output, output, secret, 1, BLAKE2S_HASH_SIZE);
 	memcpy(first_dst, output, first_len);
 
 	if (!second_dst || !second_len)
@@ -337,8 +372,7 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,
 
 	/* Expand second key: key = secret, data = first-key || 0x2 */
 	output[BLAKE2S_HASH_SIZE] = 2;
-	blake2s256_hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1,
-			BLAKE2S_HASH_SIZE);
+	hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE);
 	memcpy(second_dst, output, second_len);
 
 	if (!third_dst || !third_len)
@@ -346,8 +380,7 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,
 
 	/* Expand third key: key = secret, data = second-key || 0x3 */
 	output[BLAKE2S_HASH_SIZE] = 3;
-	blake2s256_hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1,
-			BLAKE2S_HASH_SIZE);
+	hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE);
 	memcpy(third_dst, output, third_len);
 
 out:
diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
index bc3fb59442ce..4e30e1799e61 100644
--- a/include/crypto/blake2s.h
+++ b/include/crypto/blake2s.h
@@ -101,7 +101,4 @@ static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
 	blake2s_final(&state, out);
 }
 
-void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
-		     const size_t keylen);
-
 #endif /* _CRYPTO_BLAKE2S_H */
diff --git a/lib/crypto/blake2s-selftest.c b/lib/crypto/blake2s-selftest.c
index 5d9ea53be973..409e4b728770 100644
--- a/lib/crypto/blake2s-selftest.c
+++ b/lib/crypto/blake2s-selftest.c
@@ -15,7 +15,6 @@
  * #include <stdio.h>
  *
  * #include <openssl/evp.h>
- * #include <openssl/hmac.h>
  *
  * #define BLAKE2S_TESTVEC_COUNT	256
  *
@@ -58,16 +57,6 @@
  *	}
  *	printf("};\n\n");
  *
- *	printf("static const u8 blake2s_hmac_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {\n");
- *
- *	HMAC(EVP_blake2s256(), key, sizeof(key), buf, sizeof(buf), hash, NULL);
- *	print_vec(hash, BLAKE2S_OUTBYTES);
- *
- *	HMAC(EVP_blake2s256(), buf, sizeof(buf), key, sizeof(key), hash, NULL);
- *	print_vec(hash, BLAKE2S_OUTBYTES);
- *
- *	printf("};\n");
- *
  *	return 0;
  *}
  */
@@ -554,15 +543,6 @@ static const u8 blake2s_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {
     0xd6, 0x98, 0x6b, 0x07, 0x10, 0x65, 0x52, 0x65, },
 };
 
-static const u8 blake2s_hmac_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {
-  { 0xce, 0xe1, 0x57, 0x69, 0x82, 0xdc, 0xbf, 0x43, 0xad, 0x56, 0x4c, 0x70,
-    0xed, 0x68, 0x16, 0x96, 0xcf, 0xa4, 0x73, 0xe8, 0xe8, 0xfc, 0x32, 0x79,
-    0x08, 0x0a, 0x75, 0x82, 0xda, 0x3f, 0x05, 0x11, },
-  { 0x77, 0x2f, 0x0c, 0x71, 0x41, 0xf4, 0x4b, 0x2b, 0xb3, 0xc6, 0xb6, 0xf9,
-    0x60, 0xde, 0xe4, 0x52, 0x38, 0x66, 0xe8, 0xbf, 0x9b, 0x96, 0xc4, 0x9f,
-    0x60, 0xd9, 0x24, 0x37, 0x99, 0xd6, 0xec, 0x31, },
-};
-
 bool __init blake2s_selftest(void)
 {
 	u8 key[BLAKE2S_KEY_SIZE];
@@ -607,16 +587,5 @@ bool __init blake2s_selftest(void)
 		}
 	}
 
-	if (success) {
-		blake2s256_hmac(hash, buf, key, sizeof(buf), sizeof(key));
-		success &= !memcmp(hash, blake2s_hmac_testvecs[0], BLAKE2S_HASH_SIZE);
-
-		blake2s256_hmac(hash, key, buf, sizeof(key), sizeof(buf));
-		success &= !memcmp(hash, blake2s_hmac_testvecs[1], BLAKE2S_HASH_SIZE);
-
-		if (!success)
-			pr_err("blake2s256_hmac self-test: FAIL\n");
-	}
-
 	return success;
 }
diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
index 93f2ae051370..9364f79937b8 100644
--- a/lib/crypto/blake2s.c
+++ b/lib/crypto/blake2s.c
@@ -30,43 +30,6 @@ void blake2s_final(struct blake2s_state *state, u8 *out)
 }
 EXPORT_SYMBOL(blake2s_final);
 
-void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
-		     const size_t keylen)
-{
-	struct blake2s_state state;
-	u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 };
-	u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32));
-	int i;
-
-	if (keylen > BLAKE2S_BLOCK_SIZE) {
-		blake2s_init(&state, BLAKE2S_HASH_SIZE);
-		blake2s_update(&state, key, keylen);
-		blake2s_final(&state, x_key);
-	} else
-		memcpy(x_key, key, keylen);
-
-	for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
-		x_key[i] ^= 0x36;
-
-	blake2s_init(&state, BLAKE2S_HASH_SIZE);
-	blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
-	blake2s_update(&state, in, inlen);
-	blake2s_final(&state, i_hash);
-
-	for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
-		x_key[i] ^= 0x5c ^ 0x36;
-
-	blake2s_init(&state, BLAKE2S_HASH_SIZE);
-	blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
-	blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE);
-	blake2s_final(&state, i_hash);
-
-	memcpy(out, i_hash, BLAKE2S_HASH_SIZE);
-	memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE);
-	memzero_explicit(i_hash, BLAKE2S_HASH_SIZE);
-}
-EXPORT_SYMBOL(blake2s256_hmac);
-
 static int __init blake2s_mod_init(void)
 {
 	if (!IS_ENABLED(CONFIG_CRYPTO_MANAGER_DISABLE_TESTS) &&
-- 
2.34.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH crypto v3 2/2] lib/crypto: sha1: re-roll loops to reduce code size
  2022-01-11 22:05     ` [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms Jason A. Donenfeld
  2022-01-11 22:05       ` [PATCH crypto v3 1/2] lib/crypto: blake2s: move hmac construction into wireguard Jason A. Donenfeld
@ 2022-01-11 22:05       ` Jason A. Donenfeld
  2022-01-12 10:59       ` [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms Geert Uytterhoeven
  2 siblings, 0 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-01-11 22:05 UTC (permalink / raw)
  To: linux-crypto, netdev, wireguard, linux-kernel, bpf, geert, tytso,
	gregkh, jeanphilippe.aumasson, ardb
  Cc: Jason A. Donenfeld, Herbert Xu

With SHA-1 no longer being used for anything performance oriented, and
also soon to be phased out entirely, we can make up for the space added
by unrolled BLAKE2s by simply re-rolling SHA-1. Since SHA-1 is so much
more complex, re-rolling it more or less takes care of the code size
added by BLAKE2s. And eventually, hopefully we'll see SHA-1 removed
entirely from most small kernel builds.

Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 lib/sha1.c | 95 ++++++++----------------------------------------------
 1 file changed, 14 insertions(+), 81 deletions(-)

diff --git a/lib/sha1.c b/lib/sha1.c
index 9bd1935a1472..0494766fc574 100644
--- a/lib/sha1.c
+++ b/lib/sha1.c
@@ -9,6 +9,7 @@
 #include <linux/kernel.h>
 #include <linux/export.h>
 #include <linux/bitops.h>
+#include <linux/string.h>
 #include <crypto/sha1.h>
 #include <asm/unaligned.h>
 
@@ -55,7 +56,8 @@
 #define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \
 	__u32 TEMP = input(t); setW(t, TEMP); \
 	E += TEMP + rol32(A,5) + (fn) + (constant); \
-	B = ror32(B, 2); } while (0)
+	B = ror32(B, 2); \
+	TEMP = E; E = D; D = C; C = B; B = A; A = TEMP; } while (0)
 
 #define T_0_15(t, A, B, C, D, E)  SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
 #define T_16_19(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
@@ -84,6 +86,7 @@
 void sha1_transform(__u32 *digest, const char *data, __u32 *array)
 {
 	__u32 A, B, C, D, E;
+	unsigned int i = 0;
 
 	A = digest[0];
 	B = digest[1];
@@ -92,94 +95,24 @@ void sha1_transform(__u32 *digest, const char *data, __u32 *array)
 	E = digest[4];
 
 	/* Round 1 - iterations 0-16 take their input from 'data' */
-	T_0_15( 0, A, B, C, D, E);
-	T_0_15( 1, E, A, B, C, D);
-	T_0_15( 2, D, E, A, B, C);
-	T_0_15( 3, C, D, E, A, B);
-	T_0_15( 4, B, C, D, E, A);
-	T_0_15( 5, A, B, C, D, E);
-	T_0_15( 6, E, A, B, C, D);
-	T_0_15( 7, D, E, A, B, C);
-	T_0_15( 8, C, D, E, A, B);
-	T_0_15( 9, B, C, D, E, A);
-	T_0_15(10, A, B, C, D, E);
-	T_0_15(11, E, A, B, C, D);
-	T_0_15(12, D, E, A, B, C);
-	T_0_15(13, C, D, E, A, B);
-	T_0_15(14, B, C, D, E, A);
-	T_0_15(15, A, B, C, D, E);
+	for (; i < 16; ++i)
+		T_0_15(i, A, B, C, D, E);
 
 	/* Round 1 - tail. Input from 512-bit mixing array */
-	T_16_19(16, E, A, B, C, D);
-	T_16_19(17, D, E, A, B, C);
-	T_16_19(18, C, D, E, A, B);
-	T_16_19(19, B, C, D, E, A);
+	for (; i < 20; ++i)
+		T_16_19(i, A, B, C, D, E);
 
 	/* Round 2 */
-	T_20_39(20, A, B, C, D, E);
-	T_20_39(21, E, A, B, C, D);
-	T_20_39(22, D, E, A, B, C);
-	T_20_39(23, C, D, E, A, B);
-	T_20_39(24, B, C, D, E, A);
-	T_20_39(25, A, B, C, D, E);
-	T_20_39(26, E, A, B, C, D);
-	T_20_39(27, D, E, A, B, C);
-	T_20_39(28, C, D, E, A, B);
-	T_20_39(29, B, C, D, E, A);
-	T_20_39(30, A, B, C, D, E);
-	T_20_39(31, E, A, B, C, D);
-	T_20_39(32, D, E, A, B, C);
-	T_20_39(33, C, D, E, A, B);
-	T_20_39(34, B, C, D, E, A);
-	T_20_39(35, A, B, C, D, E);
-	T_20_39(36, E, A, B, C, D);
-	T_20_39(37, D, E, A, B, C);
-	T_20_39(38, C, D, E, A, B);
-	T_20_39(39, B, C, D, E, A);
+	for (; i < 40; ++i)
+		T_20_39(i, A, B, C, D, E);
 
 	/* Round 3 */
-	T_40_59(40, A, B, C, D, E);
-	T_40_59(41, E, A, B, C, D);
-	T_40_59(42, D, E, A, B, C);
-	T_40_59(43, C, D, E, A, B);
-	T_40_59(44, B, C, D, E, A);
-	T_40_59(45, A, B, C, D, E);
-	T_40_59(46, E, A, B, C, D);
-	T_40_59(47, D, E, A, B, C);
-	T_40_59(48, C, D, E, A, B);
-	T_40_59(49, B, C, D, E, A);
-	T_40_59(50, A, B, C, D, E);
-	T_40_59(51, E, A, B, C, D);
-	T_40_59(52, D, E, A, B, C);
-	T_40_59(53, C, D, E, A, B);
-	T_40_59(54, B, C, D, E, A);
-	T_40_59(55, A, B, C, D, E);
-	T_40_59(56, E, A, B, C, D);
-	T_40_59(57, D, E, A, B, C);
-	T_40_59(58, C, D, E, A, B);
-	T_40_59(59, B, C, D, E, A);
+	for (; i < 60; ++i)
+		T_40_59(i, A, B, C, D, E);
 
 	/* Round 4 */
-	T_60_79(60, A, B, C, D, E);
-	T_60_79(61, E, A, B, C, D);
-	T_60_79(62, D, E, A, B, C);
-	T_60_79(63, C, D, E, A, B);
-	T_60_79(64, B, C, D, E, A);
-	T_60_79(65, A, B, C, D, E);
-	T_60_79(66, E, A, B, C, D);
-	T_60_79(67, D, E, A, B, C);
-	T_60_79(68, C, D, E, A, B);
-	T_60_79(69, B, C, D, E, A);
-	T_60_79(70, A, B, C, D, E);
-	T_60_79(71, E, A, B, C, D);
-	T_60_79(72, D, E, A, B, C);
-	T_60_79(73, C, D, E, A, B);
-	T_60_79(74, B, C, D, E, A);
-	T_60_79(75, A, B, C, D, E);
-	T_60_79(76, E, A, B, C, D);
-	T_60_79(77, D, E, A, B, C);
-	T_60_79(78, C, D, E, A, B);
-	T_60_79(79, B, C, D, E, A);
+	for (; i < 80; ++i)
+		T_60_79(i, A, B, C, D, E);
 
 	digest[0] += A;
 	digest[1] += B;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems
  2022-01-11 13:49   ` [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems Jason A. Donenfeld
@ 2022-01-12 10:57     ` Geert Uytterhoeven
  2022-01-12 13:16       ` Jason A. Donenfeld
  2022-01-12 18:31     ` Eric Biggers
  1 sibling, 1 reply; 23+ messages in thread
From: Geert Uytterhoeven @ 2022-01-12 10:57 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Linux Crypto Mailing List, netdev, wireguard,
	Linux Kernel Mailing List, bpf, Theodore Tso, Greg KH,
	jeanphilippe.aumasson, Ard Biesheuvel, Herbert Xu

Hi Jason,

On Tue, Jan 11, 2022 at 2:49 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Re-wind the loops entirely on kernels optimized for code size. This is
> really not good at all performance-wise. But on m68k, it shaves off 4k
> of code size, which is apparently important.

On arm32:

add/remove: 1/0 grow/shrink: 0/1 up/down: 160/-4212 (-4052)
Function                                     old     new   delta
blake2s_sigma                                  -     160    +160
blake2s_compress_generic                    4872     660   -4212
Total: Before=9846148, After=9842096, chg -0.04%

On arm64:

add/remove: 1/2 grow/shrink: 0/1 up/down: 160/-4584 (-4424)
Function                                     old     new   delta
blake2s_sigma                                  -     160    +160
e843419@0710_00007634_e8a0                     8       -      -8
e843419@0441_0000423a_178c                     8       -      -8
blake2s_compress_generic                    5088     520   -4568
Total: Before=32800278, After=32795854, chg -0.01%

> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

For the size reduction:
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms
  2022-01-11 22:05     ` [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms Jason A. Donenfeld
  2022-01-11 22:05       ` [PATCH crypto v3 1/2] lib/crypto: blake2s: move hmac construction into wireguard Jason A. Donenfeld
  2022-01-11 22:05       ` [PATCH crypto v3 2/2] lib/crypto: sha1: re-roll loops to reduce code size Jason A. Donenfeld
@ 2022-01-12 10:59       ` Geert Uytterhoeven
  2022-01-12 13:18         ` Jason A. Donenfeld
  2 siblings, 1 reply; 23+ messages in thread
From: Geert Uytterhoeven @ 2022-01-12 10:59 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Linux Crypto Mailing List, netdev, wireguard,
	Linux Kernel Mailing List, bpf, Theodore Tso, Greg KH,
	jeanphilippe.aumasson, Ard Biesheuvel

Hi Jason,

On Tue, Jan 11, 2022 at 11:05 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Geert emailed me this afternoon concerned about blake2s codesize on m68k
> and other small systems. We identified two effective ways of chopping
> down the size. One of them moves some wireguard-specific things into
> wireguard proper. The other one adds a slower codepath for small
> machines to blake2s. This worked, and was v1 of this patchset, but I
> wasn't so much of a fan. Then someone pointed out that the generic C
> SHA-1 implementation is still unrolled, which is a *lot* of extra code.
> Simply rerolling that saves about as much as v1 did. So, we instead do
> that in this patchset. SHA-1 is being phased out, and soon it won't
> be included at all (hopefully). And nothing performance-oriented has
> anything to do with it anyway.
>
> The result of these two patches mitigates Geert's feared code size
> increase for 5.17.
>
> v3 improves on v2 by making the re-rolling of SHA-1 much simpler,
> resulting in even larger code size reduction and much better
> performance. The reason I'm sending yet a third version in such a short
> amount of time is because the trick here feels obvious and substantial
> enough that I'd hate for Geert to waste time measuring the impact of the
> previous commit.
>
> Thanks,
> Jason
>
> Jason A. Donenfeld (2):
>   lib/crypto: blake2s: move hmac construction into wireguard
>   lib/crypto: sha1: re-roll loops to reduce code size

Thanks for the series!

On m68k:
add/remove: 1/4 grow/shrink: 0/1 up/down: 4/-4232 (-4228)
Function                                     old     new   delta
__ksymtab_blake2s256_hmac                     12       -     -12
blake2s_init.constprop                        94       -     -94
blake2s256_hmac                              302       -    -302
sha1_transform                              4402     582   -3820
Total: Before=4230537, After=4226309, chg -0.10%

Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems
  2022-01-12 10:57     ` Geert Uytterhoeven
@ 2022-01-12 13:16       ` Jason A. Donenfeld
  0 siblings, 0 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-01-12 13:16 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Linux Crypto Mailing List, netdev, WireGuard mailing list,
	Linux Kernel Mailing List, bpf, Theodore Tso, Greg KH,
	Jean-Philippe Aumasson, Ard Biesheuvel, Herbert Xu

Hi Geert,

Thanks for testing this. However, I've *abandoned* this patch, due to
unacceptable performance hits, and figuring out that we can accomplish
basically the same thing without as large of a hit by modifying the
obsolete sha1 implementation.

Herbert - please do not apply this patch. Instead, later versions of
this patchset (e.g. v3 [1] and potentially later if it comes to that)
are what should be applied.

Jason

[1] https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@zx2c4.com/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms
  2022-01-12 10:59       ` [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms Geert Uytterhoeven
@ 2022-01-12 13:18         ` Jason A. Donenfeld
  2022-01-18  6:42           ` Herbert Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-01-12 13:18 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Linux Crypto Mailing List, netdev, WireGuard mailing list,
	Linux Kernel Mailing List, bpf, Theodore Tso, Greg KH,
	Jean-Philippe Aumasson, Ard Biesheuvel, Herbert Xu

Hi Geert,

On Wed, Jan 12, 2022 at 12:00 PM Geert Uytterhoeven
<geert@linux-m68k.org> wrote:
> Thanks for the series!
>
> On m68k:
> add/remove: 1/4 grow/shrink: 0/1 up/down: 4/-4232 (-4228)
> Function                                     old     new   delta
> __ksymtab_blake2s256_hmac                     12       -     -12
> blake2s_init.constprop                        94       -     -94
> blake2s256_hmac                              302       -    -302
> sha1_transform                              4402     582   -3820
> Total: Before=4230537, After=4226309, chg -0.10%
>
> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>

Excellent, thanks for the breakdown. So this shaves off ~4k, which was
about what we were shooting for here, so I think indeed this series
accomplishes its goal of counteracting the addition of BLAKE2s.
Hopefully Herbert will apply this series for 5.17.

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems
  2022-01-11 13:49   ` [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems Jason A. Donenfeld
  2022-01-12 10:57     ` Geert Uytterhoeven
@ 2022-01-12 18:31     ` Eric Biggers
  2022-01-12 18:50       ` Jason A. Donenfeld
  1 sibling, 1 reply; 23+ messages in thread
From: Eric Biggers @ 2022-01-12 18:31 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: linux-crypto, netdev, wireguard, linux-kernel, bpf, geert, tytso,
	gregkh, jeanphilippe.aumasson, ardb, Herbert Xu

On Tue, Jan 11, 2022 at 02:49:33PM +0100, Jason A. Donenfeld wrote:
> Re-wind the loops entirely on kernels optimized for code size. This is
> really not good at all performance-wise. But on m68k, it shaves off 4k
> of code size, which is apparently important.
> 
> Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
>  lib/crypto/blake2s-generic.c | 30 ++++++++++++++++++------------
>  1 file changed, 18 insertions(+), 12 deletions(-)
> 
> diff --git a/lib/crypto/blake2s-generic.c b/lib/crypto/blake2s-generic.c
> index 75ccb3e633e6..990f000e22ee 100644
> --- a/lib/crypto/blake2s-generic.c
> +++ b/lib/crypto/blake2s-generic.c
> @@ -46,7 +46,7 @@ void blake2s_compress_generic(struct blake2s_state *state, const u8 *block,
>  {
>  	u32 m[16];
>  	u32 v[16];
> -	int i;
> +	int i, j;
>  
>  	WARN_ON(IS_ENABLED(DEBUG) &&
>  		(nblocks > 1 && inc != BLAKE2S_BLOCK_SIZE));
> @@ -86,17 +86,23 @@ void blake2s_compress_generic(struct blake2s_state *state, const u8 *block,
>  	G(r, 6, v[2], v[ 7], v[ 8], v[13]); \
>  	G(r, 7, v[3], v[ 4], v[ 9], v[14]); \
>  } while (0)
> -		ROUND(0);
> -		ROUND(1);
> -		ROUND(2);
> -		ROUND(3);
> -		ROUND(4);
> -		ROUND(5);
> -		ROUND(6);
> -		ROUND(7);
> -		ROUND(8);
> -		ROUND(9);
> -
> +		if (IS_ENABLED(CONFIG_CC_OPTIMIZE_FOR_SIZE)) {
> +			for (i = 0; i < 10; ++i) {
> +				for (j = 0; j < 8; ++j)
> +					G(i, j, v[j % 4], v[((j + (j / 4)) % 4) + 4], v[((j + 2 * (j / 4)) % 4) + 8], v[((j + 3 * (j / 4)) % 4) + 12]);
> +			}

How about unrolling the inner loop but not the outer one?  Wouldn't that give
most of the benefit, without hurting performance as much?

If you stay with this approach and don't unroll either loop, can you use 'r' and
'i' instead of 'i' and 'j', to match the naming in G()?

Also, please wrap lines at 80 columns.

- Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH crypto 2/2] lib/crypto: blake2s: move hmac construction into wireguard
  2022-01-11 13:49   ` [PATCH crypto 2/2] lib/crypto: blake2s: move hmac construction into wireguard Jason A. Donenfeld
  2022-01-11 14:43     ` Ard Biesheuvel
@ 2022-01-12 18:35     ` Eric Biggers
  1 sibling, 0 replies; 23+ messages in thread
From: Eric Biggers @ 2022-01-12 18:35 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: linux-crypto, netdev, wireguard, linux-kernel, bpf, geert, tytso,
	gregkh, jeanphilippe.aumasson, ardb, Herbert Xu

On Tue, Jan 11, 2022 at 02:49:34PM +0100, Jason A. Donenfeld wrote:
> Basically nobody should use blake2s in an HMAC construction; it already
> has a keyed variant. But for unfortunately historical reasons, Noise,
> used by WireGuard, uses HKDF quite strictly, which means we have to use
> this. Because this really shouldn't be used by others, this commit moves
> it into wireguard's noise.c locally, so that kernels that aren't using
> WireGuard don't get this superfluous code baked in. On m68k systems,
> this shaves off ~314 bytes.
> 
> Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: netdev@vger.kernel.org
> Cc: wireguard@lists.zx2c4.com
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---

Reviewed-by: Eric Biggers <ebiggers@google.com>

- Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems
  2022-01-12 18:31     ` Eric Biggers
@ 2022-01-12 18:50       ` Jason A. Donenfeld
  2022-01-12 21:27         ` David Laight
  0 siblings, 1 reply; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-01-12 18:50 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Linux Crypto Mailing List, Netdev, WireGuard mailing list, LKML,
	bpf, Geert Uytterhoeven, Theodore Ts'o, Greg Kroah-Hartman,
	Jean-Philippe Aumasson, Ard Biesheuvel, Herbert Xu

On Wed, Jan 12, 2022 at 7:32 PM Eric Biggers <ebiggers@kernel.org> wrote:
> How about unrolling the inner loop but not the outer one?  Wouldn't that give
> most of the benefit, without hurting performance as much?
>
> If you stay with this approach and don't unroll either loop, can you use 'r' and
> 'i' instead of 'i' and 'j', to match the naming in G()?

All this might work, sure. But as mentioned earlier, I've abandoned
this entirely, as I don't think this patch is necessary. See the v3
patchset instead:

https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@zx2c4.com/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems
  2022-01-12 18:50       ` Jason A. Donenfeld
@ 2022-01-12 21:27         ` David Laight
  2022-01-12 22:00           ` Jason A. Donenfeld
  0 siblings, 1 reply; 23+ messages in thread
From: David Laight @ 2022-01-12 21:27 UTC (permalink / raw)
  To: 'Jason A. Donenfeld', Eric Biggers
  Cc: Linux Crypto Mailing List, Netdev, WireGuard mailing list, LKML,
	bpf, Geert Uytterhoeven, Theodore Ts'o, Greg Kroah-Hartman,
	Jean-Philippe Aumasson, Ard Biesheuvel, Herbert Xu

From: Jason A. Donenfeld
> Sent: 12 January 2022 18:51
> 
> On Wed, Jan 12, 2022 at 7:32 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > How about unrolling the inner loop but not the outer one?  Wouldn't that give
> > most of the benefit, without hurting performance as much?
> >
> > If you stay with this approach and don't unroll either loop, can you use 'r' and
> > 'i' instead of 'i' and 'j', to match the naming in G()?
> 
> All this might work, sure. But as mentioned earlier, I've abandoned
> this entirely, as I don't think this patch is necessary. See the v3
> patchset instead:
> 
> https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@zx2c4.com/

I think you mentioned in another thread that the buffers (eg for IPv6
addresses) are actually often quite short.

For short buffers the 'rolled-up' loop may be of similar performance
to the unrolled one because of the time taken to read all the instructions
into the I-cache and decode them.
If the loop ends up small enough it will fit into the 'decoded loop
buffer' of modern Intel x86 cpu and won't even need decoding on
each iteration.

I really suspect that the heavily unrolled loop is only really fast
for big buffers and/or when it is already in the I-cache.
In real life I wonder how often that actually happens?
Especially for the uses the kernel is making of the code.

You need to benchmark single executions of the function
(doable on x86 with the performance monitor cycle counter)
to get typical/best clocks/byte figures rather than a
big average for repeated operation on a long buffer.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems
  2022-01-12 21:27         ` David Laight
@ 2022-01-12 22:00           ` Jason A. Donenfeld
  0 siblings, 0 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-01-12 22:00 UTC (permalink / raw)
  To: David Laight
  Cc: Eric Biggers, Linux Crypto Mailing List, Netdev,
	WireGuard mailing list, LKML, bpf, Geert Uytterhoeven,
	Theodore Ts'o, Greg Kroah-Hartman, Jean-Philippe Aumasson,
	Ard Biesheuvel, Herbert Xu

Hi David,

On 1/12/22, David Laight <David.Laight@aculab.com> wrote:
> I think you mentioned in another thread that the buffers (eg for IPv6
> addresses) are actually often quite short.
>
> For short buffers the 'rolled-up' loop may be of similar performance
> to the unrolled one because of the time taken to read all the instructions
> into the I-cache and decode them.
> If the loop ends up small enough it will fit into the 'decoded loop
> buffer' of modern Intel x86 cpu and won't even need decoding on
> each iteration.
>
> I really suspect that the heavily unrolled loop is only really fast
> for big buffers and/or when it is already in the I-cache.
> In real life I wonder how often that actually happens?
> Especially for the uses the kernel is making of the code.
>
> You need to benchmark single executions of the function
> (doable on x86 with the performance monitor cycle counter)
> to get typical/best clocks/byte figures rather than a
> big average for repeated operation on a long buffer.
>
> 	David

This patch has been dropped entirely from future revisions. The latest
as of writing is at:

https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@zx2c4.com/

If you'd like to do something with blake2s, by all means submit a
patch and include various rationale and metrics and benchmarks. I do
not intend to do that myself and do not think my particular patch here
should be merged. But if you'd like to do something, feel free to CC
me for a review. However, as mentioned, I don't think much needs to be
done here.

Again, v3 is here:
https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@zx2c4.com/

Thanks,
Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms
  2022-01-12 13:18         ` Jason A. Donenfeld
@ 2022-01-18  6:42           ` Herbert Xu
  2022-01-18 11:43             ` Jason A. Donenfeld
  0 siblings, 1 reply; 23+ messages in thread
From: Herbert Xu @ 2022-01-18  6:42 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: geert, linux-crypto, netdev, wireguard, linux-kernel, bpf, tytso,
	gregkh, jeanphilippe.aumasson, ardb

Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Excellent, thanks for the breakdown. So this shaves off ~4k, which was
> about what we were shooting for here, so I think indeed this series
> accomplishes its goal of counteracting the addition of BLAKE2s.
> Hopefully Herbert will apply this series for 5.17.

As the patches that triggered this weren't part of the crypto
tree, this will have to go through the random tree if you want
them for 5.17.

Otherwise if you're happy to wait then I can pull them through
cryptodev.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms
  2022-01-18  6:42           ` Herbert Xu
@ 2022-01-18 11:43             ` Jason A. Donenfeld
  2022-01-18 12:44               ` David Laight
  0 siblings, 1 reply; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-01-18 11:43 UTC (permalink / raw)
  To: Herbert Xu
  Cc: geert, linux-crypto, netdev, wireguard, linux-kernel, bpf, tytso,
	gregkh, jeanphilippe.aumasson, ardb

On 1/18/22, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> As the patches that triggered this weren't part of the crypto
> tree, this will have to go through the random tree if you want
> them for 5.17.

Sure, will do.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms
  2022-01-18 11:43             ` Jason A. Donenfeld
@ 2022-01-18 12:44               ` David Laight
  2022-01-18 12:50                 ` Jason A. Donenfeld
  0 siblings, 1 reply; 23+ messages in thread
From: David Laight @ 2022-01-18 12:44 UTC (permalink / raw)
  To: 'Jason A. Donenfeld', Herbert Xu
  Cc: geert, linux-crypto, netdev, wireguard, linux-kernel, bpf, tytso,
	gregkh, jeanphilippe.aumasson, ardb

From: Jason A. Donenfeld
> Sent: 18 January 2022 11:43
> 
> On 1/18/22, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > As the patches that triggered this weren't part of the crypto
> > tree, this will have to go through the random tree if you want
> > them for 5.17.
> 
> Sure, will do.

I've rammed the code through godbolt... https://godbolt.org/z/Wv64z9zG8

Some things I've noticed;

1) There is no point having all the inline functions.
   Far better to have real functions to do the work.
   Given the cost of hashing 64 bytes of data the extra
   function call won't matter.
   Indeed for repeated calls it will help because the required
   code will be in the I-cache.

2) The compiles I tried do manage to remove the blake2_sigma[][]
   when unrolling everything - which is a slight gain for the full
   unroll. But I doubt it is that significant if the access can
   get sensibly optimised.
   For non-x86 that might require all the values by multiplied by 4.

3) Although G() is a massive register dependency chain the compiler
   knows that G(,[0-3],) are independent and can execute in parallel.
   This does help execution time on multi-issue cpu (like x86).
   With care it ought to be possible to use the same code for G(,[4-7],)
   without stopping the compiler interleaving all the instructions.

4) I strongly suspect that using a loop for the rounds will have
   minimal impact on performance - especially if the first call is
   'cold cache'.
   But I've not got time to test the code.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms
  2022-01-18 12:44               ` David Laight
@ 2022-01-18 12:50                 ` Jason A. Donenfeld
  0 siblings, 0 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-01-18 12:50 UTC (permalink / raw)
  To: David Laight
  Cc: Herbert Xu, geert, linux-crypto, netdev, wireguard, linux-kernel,
	bpf, tytso, gregkh, jeanphilippe.aumasson, ardb

On Tue, Jan 18, 2022 at 1:45 PM David Laight <David.Laight@aculab.com> wrote:
> I've rammed the code through godbolt... https://godbolt.org/z/Wv64z9zG8
>
> Some things I've noticed;

It seems like you've done a lot of work here but...

>    But I've not got time to test the code.

But you're not going to take it all the way. So it unfortunately
amounts to mailing list armchair optimization. That's too bad because
it really seems like you might be onto something worth seeing through.
As I've mentioned a few times now, I've dropped the blake2s
optimization patch, and I won't be developing that further. But it
appears as though you've really been captured by it, so I urge you:
please send a real patch with benchmarks on various platforms! (And CC
me on the patch.) Faster reference code would really be terrific.

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-01-25 16:09 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAHmME9qbnYmhvsuarButi6s=58=FPiti0Z-QnGMJ=OsMzy1eOg@mail.gmail.com>
2022-01-11 13:49 ` [PATCH crypto 0/2] smaller blake2s code size on m68k and other small platforms Jason A. Donenfeld
2022-01-11 13:49   ` [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems Jason A. Donenfeld
2022-01-12 10:57     ` Geert Uytterhoeven
2022-01-12 13:16       ` Jason A. Donenfeld
2022-01-12 18:31     ` Eric Biggers
2022-01-12 18:50       ` Jason A. Donenfeld
2022-01-12 21:27         ` David Laight
2022-01-12 22:00           ` Jason A. Donenfeld
2022-01-11 13:49   ` [PATCH crypto 2/2] lib/crypto: blake2s: move hmac construction into wireguard Jason A. Donenfeld
2022-01-11 14:43     ` Ard Biesheuvel
2022-01-12 18:35     ` Eric Biggers
2022-01-11 18:10   ` [PATCH crypto v2 0/2] reduce code size from blake2s on m68k and other small platforms Jason A. Donenfeld
2022-01-11 18:10     ` [PATCH crypto v2 1/2] lib/crypto: blake2s: move hmac construction into wireguard Jason A. Donenfeld
2022-01-11 18:10     ` [PATCH crypto v2 2/2] lib/crypto: sha1: re-roll loops to reduce code size Jason A. Donenfeld
2022-01-11 22:05     ` [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms Jason A. Donenfeld
2022-01-11 22:05       ` [PATCH crypto v3 1/2] lib/crypto: blake2s: move hmac construction into wireguard Jason A. Donenfeld
2022-01-11 22:05       ` [PATCH crypto v3 2/2] lib/crypto: sha1: re-roll loops to reduce code size Jason A. Donenfeld
2022-01-12 10:59       ` [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms Geert Uytterhoeven
2022-01-12 13:18         ` Jason A. Donenfeld
2022-01-18  6:42           ` Herbert Xu
2022-01-18 11:43             ` Jason A. Donenfeld
2022-01-18 12:44               ` David Laight
2022-01-18 12:50                 ` Jason A. Donenfeld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).