From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.3 required=5.0 tests=DKIM_SIGNED,DKIM_VALID,
	MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED autolearn=ham autolearn_force=no
	version=3.4.4
Received: (qmail 26276 invoked from network); 21 Sep 2023 09:26:58 -0000
Received: from zero.zsh.org (2a02:898:31:0:48:4558:7a:7368)
  by inbox.vuxu.org with ESMTPUTF8; 21 Sep 2023 09:26:58 -0000
ARC-Seal: i=1; cv=none; a=rsa-sha256; d=zsh.org; s=rsa-20210803; t=1695288418;
	 b=Yekyw/Bm5TUlmP2As0QLAQdlsQT/Bft9Hbdhl3sWhooB+iEkAIRfDDM3SXuFQg9SMzxdca5788
	  5SFqiYQ+hPKNZTBDj9V8Yq5Y97nl+VU7/fSdVhyTfxRKo1WLWEQ5uCQlB81JzXjA0A/TvUeqi0
	  QtevTnhlqptk7Gc/Y8dw+WZx5JZndTkGmgV5hM9f0d3gEcfAvraTRjA4f/CB40K7zS18JWG3az
	  uL+v8qfzow/NanOCWWCtvg9MKdsWwGYBglcHgfV9X9tsFGc0MXtu/iiMleYAoiCyu8GqZJXWvX
	  DHXJ19WJ8veeCPPkjx0BpNzLwZjZn0ngVOI1fVeFKij7hA==;
ARC-Authentication-Results: i=1; zsh.org;
	iprev=pass (snd01009-bg.im.kddi.ne.jp) smtp.remote-ip=27.86.113.25;
	dmarc=none header.from=kba.biglobe.ne.jp;
	arc=none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed; d=zsh.org; s=rsa-20210803; t=1695288418;
	bh=Dl/MLsPg6EmJxfQR1w63X8Ia+UgVlUYYqdhf5SH1sBw=;
	h=List-Archive:List-Owner:List-Post:List-Unsubscribe:List-Subscribe:List-Help:
	  List-Id:Sender:To:References:Message-ID:Content-Transfer-Encoding:Date:
	  In-Reply-To:From:Subject:MIME-Version:Content-Type:DKIM-Signature;
	b=LoAAl+WNfr6sqJwuMYV0Co5kEkr1gAoYm760wqzi/A3jDfvH0ciwYEUafX4pVcP6bPIcbwPUEC
	  7GcYUhtFx7+TLZG1lp0gjNRcTIApwEfRSrySZeW/5kUceBOm7Q9zsXXmrpf5BJC/su0X1kdK/h
	  /X1z7LMO7BLvopn7IIgI+6cdhIcb3jGdUCaGUgdGYbqkvZAYOEE2/JrYMhPMA2slOTYhVchRgH
	  YaufZ2uOVK05CqqEqtzzF0sdXzIg/AW3sVQGc/WpxjCPCyUSA77RkKR77jo4GSXghvslF8UX9D
	  7UisXGdB1OIklEIgSbmdKq1TJUdwlvJwk1d7fIWrsWCKug==;
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=zsh.org;
	s=rsa-20210803; h=List-Archive:List-Owner:List-Post:List-Unsubscribe:
	List-Subscribe:List-Help:List-Id:Sender:To:References:Message-Id:
	Content-Transfer-Encoding:Date:In-Reply-To:From:Subject:Mime-Version:
	Content-Type:Reply-To:Cc:Content-ID:Content-Description:Resent-Date:
	Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID;
	bh=H3hVfEPt9aYNULnEGlIYWTjVH0bGXkLON0Nr3uVyZlM=; b=Ezb7pJW/OBPgmftr2w564KND+M
	w950V7VSoPpP78h9IOvsmzGrsuh4DGKFtvgk5xHgMFY83MVZYR4u6t4ImhZNrNOOeiM9KkCaUKUJa
	TgyzcuWmQ31xK0uV/k3C6f0qW/oszLYwPytoIl7v46WglJ6U/XxKeJ0EYATa3BnNMZFEGhePxN0IK
	emcQ/Wkg6RKHWAsnfnk/dQy6DFamoYFBCgRH9tGaXPOvnoRmEYgZMQAixzEgcF7yJsO9yw4bG0seo
	KEl3iEcIPsZapib68OwKc2++CTQoxh9IWquGz9tFmCg94yNVbXc5WgHT4sRlw/RrDCQfJee3eGnhz
	glrXRurQ==;
Received: by zero.zsh.org with local
	id 1qjFxY-000I3P-Mi;
	Thu, 21 Sep 2023 09:26:56 +0000
Authentication-Results: zsh.org;
	iprev=pass (snd01009-bg.im.kddi.ne.jp) smtp.remote-ip=27.86.113.25;
	dmarc=none header.from=kba.biglobe.ne.jp;
	arc=none
Received: from snd01009-bg.im.kddi.ne.jp ([27.86.113.25]:7201 helo=dfmta1010.biglobe.ne.jp)
	by zero.zsh.org with esmtps (TLS1.3:TLS_AES_256_GCM_SHA384:256)
	id 1qjFxH-000HmA-BY;
	Thu, 21 Sep 2023 09:26:41 +0000
Received: from mail.biglobe.ne.jp by omta1010.biglobe.ne.jp with ESMTP
          id <20230921092633075.IUJE.19373.mail.biglobe.ne.jp@biglobe.ne.jp>
          for <zsh-workers@zsh.org>; Thu, 21 Sep 2023 18:26:33 +0900
Content-Type: text/plain;
	charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.21\))
Subject: Re: (LC_ALL=C; set -x 128 129; printf "%s\n" ${(#)@} | hexdump -C)
From: Jun T <takimoto-j@kba.biglobe.ne.jp>
X-Priority: 3
In-Reply-To: <C3C2FCAB-0D5D-4CD0-8949-75AA9971681E@kba.biglobe.ne.jp>
Date: Thu, 21 Sep 2023 18:26:31 +0900
Content-Transfer-Encoding: quoted-printable
Message-Id: <B84E7C6D-EB70-4610-8E91-E48AFD406357@kba.biglobe.ne.jp>
References: <20230830072753.hhveg7teosubwzq7@chazelas.org>
 <CAH+w=7b58OFubjUOdYuqm0U3hYPKFbX-GSuj7R7H6T8jmtTG2Q@mail.gmail.com>
 <88812889-04BC-412A-85BE-BDAA2326B29B@kba.biglobe.ne.jp>
 <EBF09792-A011-44F6-A36C-8F704A1A1FF8@kba.biglobe.ne.jp>
 <899459233.232418.1694104433053@mail.virginmedia.com>
 <64346084-434A-4A42-AD56-44809DA2E54C@kba.biglobe.ne.jp>
 <968697743.3642134.1694422642580@mail.virginmedia.com>
 <C4E4AB11-41FC-4EA8-8B8A-C531241E2429@kba.biglobe.ne.jp>
 <C3C2FCAB-0D5D-4CD0-8949-75AA9971681E@kba.biglobe.ne.jp>
To: zsh-workers@zsh.org
X-Mailer: Apple Mail (2.3445.104.21)
X-Biglobe-Sender: takimoto-j@kba.biglobe.ne.jp
X-Seq: 52169
Archived-At: <https://zsh.org/workers/52169>
X-Loop: zsh-workers@zsh.org
Errors-To: zsh-workers-owner@zsh.org
Precedence: list
Precedence: bulk
Sender: zsh-workers-request@zsh.org
X-no-archive: yes
List-Id: <zsh-workers.zsh.org>
List-Help: <http://www.zsh.org/sympa/help>, <mailto:sympa@zsh.org?subject=HELP>
List-Subscribe: <http://www.zsh.org/sympa/subscribe/zsh-workers>, <mailto:sympa@zsh.org?subject=SUB%20zsh-workers>
List-Unsubscribe: <http://www.zsh.org/sympa/signoff/zsh-workers>, <mailto:sympa@zsh.org?subject=SIG%20zsh-workers>
List-Post: <mailto:zsh-workers@zsh.org>
List-Owner: <mailto:zsh-workers-request@zsh.org>
List-Archive: <http://www.zsh.org/sympa/arc/zsh-workers>


> 2023/09/13 18:59, Jun T <takimoto-j@kba.biglobe.ne.jp> wrote:
>=20
> the test fails on FreeBSD,
> DragonFly and NetBSD for out-of-range characters.
(snip)
> This is due to the peculiar behavior of iconv(3). It converts
> out-of-range character to '?' (0x3f) with return value 1,

This behavior of iconv(3) is explicitly documented in the manpage
and we can't say it's a bug, but anyway I think we should treat
the positive return value of iconv() in the same way as -1.

But simply replacing (utils.c:7046)
	if (count =3D=3D (size_t)-1) {
by
	if (count) {
didn't work because of the complication due to errflag/noerrs.
So I moved the conversion code into a new function ucs4tomb().

Two more modifications:
[1] Negative value, such as ${(#X):--1}, is now an error.

[2] If __STDC_ISO_10646__ is not defined, for example in macOS, and
UTF-8 locale is in use, then ucs4toutf8() is used for the conversion.
This function now accepts only the range 0 - 0x7fff_ffff because
wctomb(3) on Linux (with UTF-8 locale) accepts this range (the old
range of UCS4).
# But now it seems UCS4 is equivalent to UTF-32 and limited to the
# range 0 - 0x10_ffff (and the maximum length of UTF-8 is 4 bytes).
# We can make ucs4toutf8() accept only this range, if that's better.
# This will also make $'\U110000' an error.


BTW, with or without my recent patch, if the (X) flag is not
given but conversion fails, then the lowest byte of the number
is output as a single byte character. Is this really useful?
If so, do we need to document it? Or we can just output ""?


diff --git a/Src/subst.c b/Src/subst.c
index dc2052ee0..347b1b8bd 100644
--- a/Src/subst.c
+++ b/Src/subst.c
@@ -1501,16 +1501,15 @@ substevalchar(char *ptr)
 	return noerrs ? dupstring(""): NULL;
     }
     errflag |=3D saved_errflag;
+    if (ires < 0) {
+	zerr("character not in range");
+    }
 #ifdef MULTIBYTE_SUPPORT
-    if (isset(MULTIBYTE) && ires > 127) {
-	/* '\\' + 'U' + 8 bytes of character + '\0' */
-	char buf[11];
-
-	/* inefficient: should separate out \U handling from =
getkeystring */
-	sprintf(buf, "\\U%.8x", (unsigned int)ires & 0xFFFFFFFFu);
-	ptr =3D getkeystring(buf, &len, GETKEYS_BINDKEY, NULL);
+    else if (isset(MULTIBYTE) && ires > 127) {
+	ptr =3D zhalloc(MB_CUR_MAX);
+	len =3D ucs4tomb((unsigned int)ires & 0xffffffff, ptr);
     }
-    if (len =3D=3D 0)
+    if (len <=3D 0)
 #endif
     {
 	ptr =3D zhalloc(2);
diff --git a/Src/utils.c b/Src/utils.c
index 7040d0954..e8d2613b4 100644
--- a/Src/utils.c
+++ b/Src/utils.c
@@ -6671,12 +6671,15 @@ dquotedzputs(char const *s, FILE *stream)
=20
 # if defined(HAVE_NL_LANGINFO) && defined(CODESET) && =
!defined(__STDC_ISO_10646__)
 /* Convert a character from UCS4 encoding to UTF-8 */
-
-static size_t
+ =20
+static int
 ucs4toutf8(char *dest, unsigned int wval)
 {
-    size_t len;
+    int len;
=20
+    /* UCS4 is now equvalent to UTF-32 and limited to 0 - 0x10_FFFF.
+     * This function accepts 0 - 0x7FFF_FFFF (old range of UCS4) to be
+     * compatible with wctomb(3) (in UTF-8 locale) on Linux. */
     if (wval < 0x80)
       len =3D 1;
     else if (wval < 0x800)
@@ -6687,8 +6690,12 @@ ucs4toutf8(char *dest, unsigned int wval)
       len =3D 4;
     else if (wval < 0x4000000)
       len =3D 5;
-    else
+    else if (wval < 0x80000000)
       len =3D 6;
+    else {
+      zerr("character not in range");
+      return -1;
+    }
=20
     switch (len) { /* falls through except to the last case */
     case 6: dest[5] =3D (wval & 0x3f) | 0x80; wval >>=3D 6;
@@ -6705,30 +6712,89 @@ ucs4toutf8(char *dest, unsigned int wval)
 }
 #endif
=20
+/* Convert UCS4 to a multibyte character in current locale.
+ * Result is saved in buf (must be at least MB_CUR_MAX bytes long).
+ * Returns the number of bytes saved in buf, or -1 if conversion fails. =
*/
=20
-/*
- * The following only occurs once or twice in the code, but in =
different
- * places depending how character set conversion is implemented.
- */
-#define CHARSET_FAILED()		      \
-    if (how & GETKEY_DOLLAR_QUOTE) {	      \
-	while ((*tdest++ =3D *++s)) {	      \
-	    if (how & GETKEY_UPDATE_OFFSET) { \
-		if (s - sstart > *misc)	      \
-		    (*misc)++;		      \
-	    }				      \
-	    if (*s =3D=3D Snull) {		      \
-		*len =3D (s - sstart) + 1;      \
-		*tdest =3D '\0';		      \
-		return buf;		      \
-	    }				      \
-	}				      \
-	*len =3D tdest - buf;		      \
-	return buf;			      \
-    }					      \
-    *t =3D '\0';				      \
-    *len =3D t - buf;			      \
-    return buf
+/**/
+int
+ucs4tomb(unsigned int wval, char *buf)
+{
+#if defined(HAVE_WCHAR_H) && defined(HAVE_WCTOMB) && =
defined(__STDC_ISO_10646__)
+    int count =3D wctomb(buf, (wchar_t)wval);
+    if (count =3D=3D -1)
+	zerr("character not in range");
+    return count;
+#else	/* !(HAVE_WCHAR_H && HAVE_WCTOMB && __STDC_ISO_10646__) */
+# if defined(HAVE_NL_LANGINFO) && defined(CODESET)
+    if (!strcmp(nl_langinfo(CODESET), "UTF-8")) {
+	return ucs4toutf8(buf, wval);
+    } else {
+#   ifdef HAVE_ICONV
+	iconv_t cd;
+	char inbuf[4], *bsave =3D buf;
+	ICONV_CONST char *inptr =3D inbuf;
+	size_t inbytes =3D 4, outbytes =3D 6;
+	const char *codesetstr =3D nl_langinfo(CODESET);
+	size_t count;
+	int i;
+
+	/*
+	 * If the code set isn't handled, we'd better assume it's =
US-ASCII
+	 * rather than just failing hopelessly.  Solaris has a weird =
habit
+	 * of returning 646.  This is handled by the native iconv(), but
+	 * not by GNU iconv; what's more, some versions of the native =
iconv
+	 * don't handle standard names like ASCII.
+	 *
+	 * This should only be a problem if there's a mismatch between =
the
+	 * NLS and the iconv in use, which probably only means if =
libiconv
+	 * is in use.  We checked at configure time if our libraries =
pulled
+	 * in _libiconv_version, which should be a good test.
+	 *
+	 * It shouldn't ever be NULL, but while we're being paranoid...
+	 */
+#     ifdef ICONV_FROM_LIBICONV
+	if (!codesetstr || !*codesetstr)
+	    codesetstr =3D "US-ASCII";
+#     endif
+	cd =3D iconv_open(codesetstr, "UCS-4BE");
+#     ifdef ICONV_FROM_LIBICONV
+	if (cd =3D=3D (iconv_t)-1 &&  !strcmp(codesetstr, "646")) {
+	    codesetstr =3D "US-ASCII";
+	    cd =3D iconv_open(codesetstr, "UCS-4BE");
+	}
+#     endif
+	if (cd =3D=3D (iconv_t)-1) {
+	    zerr("cannot do charset conversion (iconv failed)");
+	    return -1;
+	}
+
+	/* store value in big endian form */
+	for (i=3D3; i>=3D0; i--) {
+	    inbuf[i] =3D wval & 0xff;
+	    wval >>=3D 8;
+	}
+	count =3D iconv(cd, &inptr, &inbytes, &buf, &outbytes);
+	iconv_close(cd);
+	if (count) {
+	    /* -1 indicates error. Positive value means number of =
"invalid"
+	     * (or "non-reversible") conversions, which we consider as
+	     * "out-of-range" characters. */
+	    zerr("character not in range");
+	    return -1;
+	}
+	return buf - bsave;
+#   else    /* !HAVE_ICONV */
+	zerr("cannot do charset conversion (iconv not available)");
+	return -1;
+#   endif   /* HAVE_ICONV */
+    }
+# else	/* !(HAVE_NL_LANGINFO && CODESET) */
+    zerr("cannot do charset conversion (NLS not supported)");
+    return -1;
+# endif	/* HAVE_NL_LANGINFO && CODESET */
+#endif	/* HAVE_WCHAR_H && HAVE_WCTOMB && __STDC_ISO_10646__ */
+}
=20
 /*
  * Decode a key string, turning it into the literal characters.
@@ -6785,21 +6851,6 @@ getkeystring(char *s, int *len, int how, int =
*misc)
     char *t, *tdest =3D NULL, *u =3D NULL, *sstart =3D s, *tbuf =3D =
NULL;
     char svchar =3D '\0';
     int meta =3D 0, control =3D 0, ignoring =3D 0;
-    int i;
-#if defined(HAVE_WCHAR_H) && defined(HAVE_WCTOMB) && =
defined(__STDC_ISO_10646__)
-    wint_t wval;
-    int count;
-#else
-    unsigned int wval;
-# if defined(HAVE_NL_LANGINFO) && defined(CODESET)
-#  if defined(HAVE_ICONV)
-    iconv_t cd;
-    char inbuf[4];
-    size_t inbytes, outbytes;
-#  endif
-    size_t count;
-# endif
-#endif
=20
     DPUTS((how & GETKEY_UPDATE_OFFSET) &&
 	  (how & ~(GETKEYS_DOLLARS_QUOTE|GETKEY_UPDATE_OFFSET)),
@@ -6864,7 +6915,8 @@ getkeystring(char *s, int *len, int how, int =
*misc)
     }
     for (; *s; s++) {
 	if (*s =3D=3D '\\' && s[1]) {
-	    int miscadded;
+	    int miscadded, count, i;
+	    unsigned int wval;
 	    if ((how & GETKEY_UPDATE_OFFSET) && s - sstart < *misc) {
 		(*misc)--;
 		miscadded =3D 1;
@@ -6979,86 +7031,32 @@ getkeystring(char *s, int *len, int how, int =
*misc)
 		    *misc =3D wval;
 		    return s+1;
 		}
-#if defined(HAVE_WCHAR_H) && defined(HAVE_WCTOMB) && =
defined(__STDC_ISO_10646__)
-		count =3D wctomb(t, (wchar_t)wval);
+		count =3D ucs4tomb(wval, t);
 		if (count =3D=3D -1) {
-		    zerr("character not in range");
-		    CHARSET_FAILED();
+		    if (how & GETKEY_DOLLAR_QUOTE) {
+			while ((*tdest++ =3D *++s)) {
+			    if (how & GETKEY_UPDATE_OFFSET) {
+				if (s - sstart > *misc)
+				    (*misc)++;
+			    }
+			    if (*s =3D=3D Snull) {
+				*len =3D (s - sstart) + 1;
+				*tdest =3D '\0';
+				return buf;
+			    }
+			}
+			*len =3D tdest - buf;
+		    }
+		    else {
+			*t =3D '\0';
+			*len =3D t - buf;
+		    }
+		    return buf;
 		}
 		if ((how & GETKEY_UPDATE_OFFSET) && s - sstart < *misc)
 		    (*misc) +=3D count;
 		t +=3D count;
-# else
-#  if defined(HAVE_NL_LANGINFO) && defined(CODESET)
-		if (!strcmp(nl_langinfo(CODESET), "UTF-8")) {
-		    count =3D ucs4toutf8(t, wval);
-		    t +=3D count;
-		    if ((how & GETKEY_UPDATE_OFFSET) && s - sstart < =
*misc)
-			(*misc) +=3D count;
-		} else {
-#   ifdef HAVE_ICONV
-		    ICONV_CONST char *inptr =3D inbuf;
-		    const char *codesetstr =3D nl_langinfo(CODESET);
-    	    	    inbytes =3D 4;
-		    outbytes =3D 6;
-		    /* store value in big endian form */
-		    for (i=3D3;i>=3D0;i--) {
-			inbuf[i] =3D wval & 0xff;
-			wval >>=3D 8;
-		    }
=20
-		    /*
-		     * If the code set isn't handled, we'd better
-		     * assume it's US-ASCII rather than just failing
-		     * hopelessly.  Solaris has a weird habit of
-		     * returning 646.  This is handled by the
-		     * native iconv(), but not by GNU iconv; what's
-		     * more, some versions of the native iconv don't
-		     * handle standard names like ASCII.
-		     *
-		     * This should only be a problem if there's a
-		     * mismatch between the NLS and the iconv in use,
-		     * which probably only means if libiconv is in use.
-		     * We checked at configure time if our libraries
-		     * pulled in _libiconv_version, which should be
-		     * a good test.
-		     *
-		     * It shouldn't ever be NULL, but while we're
-		     * being paranoid...
-		     */
-#ifdef ICONV_FROM_LIBICONV
-		    if (!codesetstr || !*codesetstr)
-			codesetstr =3D "US-ASCII";
-#endif
-    	    	    cd =3D iconv_open(codesetstr, "UCS-4BE");
-#ifdef ICONV_FROM_LIBICONV
-		    if (cd =3D=3D (iconv_t)-1 &&  !strcmp(codesetstr, =
"646")) {
-			codesetstr =3D "US-ASCII";
-			cd =3D iconv_open(codesetstr, "UCS-4BE");
-		    }
-#endif
-		    if (cd =3D=3D (iconv_t)-1) {
-			zerr("cannot do charset conversion (iconv =
failed)");
-			CHARSET_FAILED();
-		    }
-                    count =3D iconv(cd, &inptr, &inbytes, &t, =
&outbytes);
-		    iconv_close(cd);
-		    if (count =3D=3D (size_t)-1) {
-                        zerr("character not in range");
-			CHARSET_FAILED();
-		    }
-		    if ((how & GETKEY_UPDATE_OFFSET) && s - sstart < =
*misc)
-			(*misc) +=3D count;
-#   else
-                    zerr("cannot do charset conversion (iconv not =
available)");
-		    CHARSET_FAILED();
-#   endif
-		}
-#  else
-                zerr("cannot do charset conversion (NLS not =
supported)");
-		CHARSET_FAILED();
-#  endif
-# endif
 		if (how & GETKEY_DOLLAR_QUOTE) {
 		    char *t2;
 		    for (t2 =3D tbuf; t2 < t; t2++) {