From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.3 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 26276 invoked from network); 21 Sep 2023 09:26:58 -0000 Received: from zero.zsh.org (2a02:898:31:0:48:4558:7a:7368) by inbox.vuxu.org with ESMTPUTF8; 21 Sep 2023 09:26:58 -0000 ARC-Seal: i=1; cv=none; a=rsa-sha256; d=zsh.org; s=rsa-20210803; t=1695288418; b=Yekyw/Bm5TUlmP2As0QLAQdlsQT/Bft9Hbdhl3sWhooB+iEkAIRfDDM3SXuFQg9SMzxdca5788 5SFqiYQ+hPKNZTBDj9V8Yq5Y97nl+VU7/fSdVhyTfxRKo1WLWEQ5uCQlB81JzXjA0A/TvUeqi0 QtevTnhlqptk7Gc/Y8dw+WZx5JZndTkGmgV5hM9f0d3gEcfAvraTRjA4f/CB40K7zS18JWG3az uL+v8qfzow/NanOCWWCtvg9MKdsWwGYBglcHgfV9X9tsFGc0MXtu/iiMleYAoiCyu8GqZJXWvX DHXJ19WJ8veeCPPkjx0BpNzLwZjZn0ngVOI1fVeFKij7hA==; ARC-Authentication-Results: i=1; zsh.org; iprev=pass (snd01009-bg.im.kddi.ne.jp) smtp.remote-ip=27.86.113.25; dmarc=none header.from=kba.biglobe.ne.jp; arc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed; d=zsh.org; s=rsa-20210803; t=1695288418; bh=Dl/MLsPg6EmJxfQR1w63X8Ia+UgVlUYYqdhf5SH1sBw=; h=List-Archive:List-Owner:List-Post:List-Unsubscribe:List-Subscribe:List-Help: List-Id:Sender:To:References:Message-ID:Content-Transfer-Encoding:Date: In-Reply-To:From:Subject:MIME-Version:Content-Type:DKIM-Signature; b=LoAAl+WNfr6sqJwuMYV0Co5kEkr1gAoYm760wqzi/A3jDfvH0ciwYEUafX4pVcP6bPIcbwPUEC 7GcYUhtFx7+TLZG1lp0gjNRcTIApwEfRSrySZeW/5kUceBOm7Q9zsXXmrpf5BJC/su0X1kdK/h /X1z7LMO7BLvopn7IIgI+6cdhIcb3jGdUCaGUgdGYbqkvZAYOEE2/JrYMhPMA2slOTYhVchRgH YaufZ2uOVK05CqqEqtzzF0sdXzIg/AW3sVQGc/WpxjCPCyUSA77RkKR77jo4GSXghvslF8UX9D 7UisXGdB1OIklEIgSbmdKq1TJUdwlvJwk1d7fIWrsWCKug==; DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=zsh.org; s=rsa-20210803; h=List-Archive:List-Owner:List-Post:List-Unsubscribe: List-Subscribe:List-Help:List-Id:Sender:To:References:Message-Id: Content-Transfer-Encoding:Date:In-Reply-To:From:Subject:Mime-Version: Content-Type:Reply-To:Cc:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID; bh=H3hVfEPt9aYNULnEGlIYWTjVH0bGXkLON0Nr3uVyZlM=; b=Ezb7pJW/OBPgmftr2w564KND+M w950V7VSoPpP78h9IOvsmzGrsuh4DGKFtvgk5xHgMFY83MVZYR4u6t4ImhZNrNOOeiM9KkCaUKUJa TgyzcuWmQ31xK0uV/k3C6f0qW/oszLYwPytoIl7v46WglJ6U/XxKeJ0EYATa3BnNMZFEGhePxN0IK emcQ/Wkg6RKHWAsnfnk/dQy6DFamoYFBCgRH9tGaXPOvnoRmEYgZMQAixzEgcF7yJsO9yw4bG0seo KEl3iEcIPsZapib68OwKc2++CTQoxh9IWquGz9tFmCg94yNVbXc5WgHT4sRlw/RrDCQfJee3eGnhz glrXRurQ==; Received: by zero.zsh.org with local id 1qjFxY-000I3P-Mi; Thu, 21 Sep 2023 09:26:56 +0000 Authentication-Results: zsh.org; iprev=pass (snd01009-bg.im.kddi.ne.jp) smtp.remote-ip=27.86.113.25; dmarc=none header.from=kba.biglobe.ne.jp; arc=none Received: from snd01009-bg.im.kddi.ne.jp ([27.86.113.25]:7201 helo=dfmta1010.biglobe.ne.jp) by zero.zsh.org with esmtps (TLS1.3:TLS_AES_256_GCM_SHA384:256) id 1qjFxH-000HmA-BY; Thu, 21 Sep 2023 09:26:41 +0000 Received: from mail.biglobe.ne.jp by omta1010.biglobe.ne.jp with ESMTP id <20230921092633075.IUJE.19373.mail.biglobe.ne.jp@biglobe.ne.jp> for ; Thu, 21 Sep 2023 18:26:33 +0900 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.21\)) Subject: Re: (LC_ALL=C; set -x 128 129; printf "%s\n" ${(#)@} | hexdump -C) From: Jun T X-Priority: 3 In-Reply-To: Date: Thu, 21 Sep 2023 18:26:31 +0900 Content-Transfer-Encoding: quoted-printable Message-Id: References: <20230830072753.hhveg7teosubwzq7@chazelas.org> <88812889-04BC-412A-85BE-BDAA2326B29B@kba.biglobe.ne.jp> <899459233.232418.1694104433053@mail.virginmedia.com> <64346084-434A-4A42-AD56-44809DA2E54C@kba.biglobe.ne.jp> <968697743.3642134.1694422642580@mail.virginmedia.com> To: zsh-workers@zsh.org X-Mailer: Apple Mail (2.3445.104.21) X-Biglobe-Sender: takimoto-j@kba.biglobe.ne.jp X-Seq: 52169 Archived-At: X-Loop: zsh-workers@zsh.org Errors-To: zsh-workers-owner@zsh.org Precedence: list Precedence: bulk Sender: zsh-workers-request@zsh.org X-no-archive: yes List-Id: List-Help: , List-Subscribe: , List-Unsubscribe: , List-Post: List-Owner: List-Archive: > 2023/09/13 18:59, Jun T wrote: >=20 > the test fails on FreeBSD, > DragonFly and NetBSD for out-of-range characters. (snip) > This is due to the peculiar behavior of iconv(3). It converts > out-of-range character to '?' (0x3f) with return value 1, This behavior of iconv(3) is explicitly documented in the manpage and we can't say it's a bug, but anyway I think we should treat the positive return value of iconv() in the same way as -1. But simply replacing (utils.c:7046) if (count =3D=3D (size_t)-1) { by if (count) { didn't work because of the complication due to errflag/noerrs. So I moved the conversion code into a new function ucs4tomb(). Two more modifications: [1] Negative value, such as ${(#X):--1}, is now an error. [2] If __STDC_ISO_10646__ is not defined, for example in macOS, and UTF-8 locale is in use, then ucs4toutf8() is used for the conversion. This function now accepts only the range 0 - 0x7fff_ffff because wctomb(3) on Linux (with UTF-8 locale) accepts this range (the old range of UCS4). # But now it seems UCS4 is equivalent to UTF-32 and limited to the # range 0 - 0x10_ffff (and the maximum length of UTF-8 is 4 bytes). # We can make ucs4toutf8() accept only this range, if that's better. # This will also make $'\U110000' an error. BTW, with or without my recent patch, if the (X) flag is not given but conversion fails, then the lowest byte of the number is output as a single byte character. Is this really useful? If so, do we need to document it? Or we can just output ""? diff --git a/Src/subst.c b/Src/subst.c index dc2052ee0..347b1b8bd 100644 --- a/Src/subst.c +++ b/Src/subst.c @@ -1501,16 +1501,15 @@ substevalchar(char *ptr) return noerrs ? dupstring(""): NULL; } errflag |=3D saved_errflag; + if (ires < 0) { + zerr("character not in range"); + } #ifdef MULTIBYTE_SUPPORT - if (isset(MULTIBYTE) && ires > 127) { - /* '\\' + 'U' + 8 bytes of character + '\0' */ - char buf[11]; - - /* inefficient: should separate out \U handling from = getkeystring */ - sprintf(buf, "\\U%.8x", (unsigned int)ires & 0xFFFFFFFFu); - ptr =3D getkeystring(buf, &len, GETKEYS_BINDKEY, NULL); + else if (isset(MULTIBYTE) && ires > 127) { + ptr =3D zhalloc(MB_CUR_MAX); + len =3D ucs4tomb((unsigned int)ires & 0xffffffff, ptr); } - if (len =3D=3D 0) + if (len <=3D 0) #endif { ptr =3D zhalloc(2); diff --git a/Src/utils.c b/Src/utils.c index 7040d0954..e8d2613b4 100644 --- a/Src/utils.c +++ b/Src/utils.c @@ -6671,12 +6671,15 @@ dquotedzputs(char const *s, FILE *stream) =20 # if defined(HAVE_NL_LANGINFO) && defined(CODESET) && = !defined(__STDC_ISO_10646__) /* Convert a character from UCS4 encoding to UTF-8 */ - -static size_t + =20 +static int ucs4toutf8(char *dest, unsigned int wval) { - size_t len; + int len; =20 + /* UCS4 is now equvalent to UTF-32 and limited to 0 - 0x10_FFFF. + * This function accepts 0 - 0x7FFF_FFFF (old range of UCS4) to be + * compatible with wctomb(3) (in UTF-8 locale) on Linux. */ if (wval < 0x80) len =3D 1; else if (wval < 0x800) @@ -6687,8 +6690,12 @@ ucs4toutf8(char *dest, unsigned int wval) len =3D 4; else if (wval < 0x4000000) len =3D 5; - else + else if (wval < 0x80000000) len =3D 6; + else { + zerr("character not in range"); + return -1; + } =20 switch (len) { /* falls through except to the last case */ case 6: dest[5] =3D (wval & 0x3f) | 0x80; wval >>=3D 6; @@ -6705,30 +6712,89 @@ ucs4toutf8(char *dest, unsigned int wval) } #endif =20 +/* Convert UCS4 to a multibyte character in current locale. + * Result is saved in buf (must be at least MB_CUR_MAX bytes long). + * Returns the number of bytes saved in buf, or -1 if conversion fails. = */ =20 -/* - * The following only occurs once or twice in the code, but in = different - * places depending how character set conversion is implemented. - */ -#define CHARSET_FAILED() \ - if (how & GETKEY_DOLLAR_QUOTE) { \ - while ((*tdest++ =3D *++s)) { \ - if (how & GETKEY_UPDATE_OFFSET) { \ - if (s - sstart > *misc) \ - (*misc)++; \ - } \ - if (*s =3D=3D Snull) { \ - *len =3D (s - sstart) + 1; \ - *tdest =3D '\0'; \ - return buf; \ - } \ - } \ - *len =3D tdest - buf; \ - return buf; \ - } \ - *t =3D '\0'; \ - *len =3D t - buf; \ - return buf +/**/ +int +ucs4tomb(unsigned int wval, char *buf) +{ +#if defined(HAVE_WCHAR_H) && defined(HAVE_WCTOMB) && = defined(__STDC_ISO_10646__) + int count =3D wctomb(buf, (wchar_t)wval); + if (count =3D=3D -1) + zerr("character not in range"); + return count; +#else /* !(HAVE_WCHAR_H && HAVE_WCTOMB && __STDC_ISO_10646__) */ +# if defined(HAVE_NL_LANGINFO) && defined(CODESET) + if (!strcmp(nl_langinfo(CODESET), "UTF-8")) { + return ucs4toutf8(buf, wval); + } else { +# ifdef HAVE_ICONV + iconv_t cd; + char inbuf[4], *bsave =3D buf; + ICONV_CONST char *inptr =3D inbuf; + size_t inbytes =3D 4, outbytes =3D 6; + const char *codesetstr =3D nl_langinfo(CODESET); + size_t count; + int i; + + /* + * If the code set isn't handled, we'd better assume it's = US-ASCII + * rather than just failing hopelessly. Solaris has a weird = habit + * of returning 646. This is handled by the native iconv(), but + * not by GNU iconv; what's more, some versions of the native = iconv + * don't handle standard names like ASCII. + * + * This should only be a problem if there's a mismatch between = the + * NLS and the iconv in use, which probably only means if = libiconv + * is in use. We checked at configure time if our libraries = pulled + * in _libiconv_version, which should be a good test. + * + * It shouldn't ever be NULL, but while we're being paranoid... + */ +# ifdef ICONV_FROM_LIBICONV + if (!codesetstr || !*codesetstr) + codesetstr =3D "US-ASCII"; +# endif + cd =3D iconv_open(codesetstr, "UCS-4BE"); +# ifdef ICONV_FROM_LIBICONV + if (cd =3D=3D (iconv_t)-1 && !strcmp(codesetstr, "646")) { + codesetstr =3D "US-ASCII"; + cd =3D iconv_open(codesetstr, "UCS-4BE"); + } +# endif + if (cd =3D=3D (iconv_t)-1) { + zerr("cannot do charset conversion (iconv failed)"); + return -1; + } + + /* store value in big endian form */ + for (i=3D3; i>=3D0; i--) { + inbuf[i] =3D wval & 0xff; + wval >>=3D 8; + } + count =3D iconv(cd, &inptr, &inbytes, &buf, &outbytes); + iconv_close(cd); + if (count) { + /* -1 indicates error. Positive value means number of = "invalid" + * (or "non-reversible") conversions, which we consider as + * "out-of-range" characters. */ + zerr("character not in range"); + return -1; + } + return buf - bsave; +# else /* !HAVE_ICONV */ + zerr("cannot do charset conversion (iconv not available)"); + return -1; +# endif /* HAVE_ICONV */ + } +# else /* !(HAVE_NL_LANGINFO && CODESET) */ + zerr("cannot do charset conversion (NLS not supported)"); + return -1; +# endif /* HAVE_NL_LANGINFO && CODESET */ +#endif /* HAVE_WCHAR_H && HAVE_WCTOMB && __STDC_ISO_10646__ */ +} =20 /* * Decode a key string, turning it into the literal characters. @@ -6785,21 +6851,6 @@ getkeystring(char *s, int *len, int how, int = *misc) char *t, *tdest =3D NULL, *u =3D NULL, *sstart =3D s, *tbuf =3D = NULL; char svchar =3D '\0'; int meta =3D 0, control =3D 0, ignoring =3D 0; - int i; -#if defined(HAVE_WCHAR_H) && defined(HAVE_WCTOMB) && = defined(__STDC_ISO_10646__) - wint_t wval; - int count; -#else - unsigned int wval; -# if defined(HAVE_NL_LANGINFO) && defined(CODESET) -# if defined(HAVE_ICONV) - iconv_t cd; - char inbuf[4]; - size_t inbytes, outbytes; -# endif - size_t count; -# endif -#endif =20 DPUTS((how & GETKEY_UPDATE_OFFSET) && (how & ~(GETKEYS_DOLLARS_QUOTE|GETKEY_UPDATE_OFFSET)), @@ -6864,7 +6915,8 @@ getkeystring(char *s, int *len, int how, int = *misc) } for (; *s; s++) { if (*s =3D=3D '\\' && s[1]) { - int miscadded; + int miscadded, count, i; + unsigned int wval; if ((how & GETKEY_UPDATE_OFFSET) && s - sstart < *misc) { (*misc)--; miscadded =3D 1; @@ -6979,86 +7031,32 @@ getkeystring(char *s, int *len, int how, int = *misc) *misc =3D wval; return s+1; } -#if defined(HAVE_WCHAR_H) && defined(HAVE_WCTOMB) && = defined(__STDC_ISO_10646__) - count =3D wctomb(t, (wchar_t)wval); + count =3D ucs4tomb(wval, t); if (count =3D=3D -1) { - zerr("character not in range"); - CHARSET_FAILED(); + if (how & GETKEY_DOLLAR_QUOTE) { + while ((*tdest++ =3D *++s)) { + if (how & GETKEY_UPDATE_OFFSET) { + if (s - sstart > *misc) + (*misc)++; + } + if (*s =3D=3D Snull) { + *len =3D (s - sstart) + 1; + *tdest =3D '\0'; + return buf; + } + } + *len =3D tdest - buf; + } + else { + *t =3D '\0'; + *len =3D t - buf; + } + return buf; } if ((how & GETKEY_UPDATE_OFFSET) && s - sstart < *misc) (*misc) +=3D count; t +=3D count; -# else -# if defined(HAVE_NL_LANGINFO) && defined(CODESET) - if (!strcmp(nl_langinfo(CODESET), "UTF-8")) { - count =3D ucs4toutf8(t, wval); - t +=3D count; - if ((how & GETKEY_UPDATE_OFFSET) && s - sstart < = *misc) - (*misc) +=3D count; - } else { -# ifdef HAVE_ICONV - ICONV_CONST char *inptr =3D inbuf; - const char *codesetstr =3D nl_langinfo(CODESET); - inbytes =3D 4; - outbytes =3D 6; - /* store value in big endian form */ - for (i=3D3;i>=3D0;i--) { - inbuf[i] =3D wval & 0xff; - wval >>=3D 8; - } =20 - /* - * If the code set isn't handled, we'd better - * assume it's US-ASCII rather than just failing - * hopelessly. Solaris has a weird habit of - * returning 646. This is handled by the - * native iconv(), but not by GNU iconv; what's - * more, some versions of the native iconv don't - * handle standard names like ASCII. - * - * This should only be a problem if there's a - * mismatch between the NLS and the iconv in use, - * which probably only means if libiconv is in use. - * We checked at configure time if our libraries - * pulled in _libiconv_version, which should be - * a good test. - * - * It shouldn't ever be NULL, but while we're - * being paranoid... - */ -#ifdef ICONV_FROM_LIBICONV - if (!codesetstr || !*codesetstr) - codesetstr =3D "US-ASCII"; -#endif - cd =3D iconv_open(codesetstr, "UCS-4BE"); -#ifdef ICONV_FROM_LIBICONV - if (cd =3D=3D (iconv_t)-1 && !strcmp(codesetstr, = "646")) { - codesetstr =3D "US-ASCII"; - cd =3D iconv_open(codesetstr, "UCS-4BE"); - } -#endif - if (cd =3D=3D (iconv_t)-1) { - zerr("cannot do charset conversion (iconv = failed)"); - CHARSET_FAILED(); - } - count =3D iconv(cd, &inptr, &inbytes, &t, = &outbytes); - iconv_close(cd); - if (count =3D=3D (size_t)-1) { - zerr("character not in range"); - CHARSET_FAILED(); - } - if ((how & GETKEY_UPDATE_OFFSET) && s - sstart < = *misc) - (*misc) +=3D count; -# else - zerr("cannot do charset conversion (iconv not = available)"); - CHARSET_FAILED(); -# endif - } -# else - zerr("cannot do charset conversion (NLS not = supported)"); - CHARSET_FAILED(); -# endif -# endif if (how & GETKEY_DOLLAR_QUOTE) { char *t2; for (t2 =3D tbuf; t2 < t; t2++) {