zsh-workers
 help / color / mirror / code / Atom feed
* (LC_ALL=C; set -x 128 129; printf "%s\n" ${(#)@} | hexdump -C)
@ 2023-08-30  7:27 Stephane Chazelas
  2023-08-30 19:45 ` Bart Schaefer
  0 siblings, 1 reply; 10+ messages in thread
From: Stephane Chazelas @ 2023-08-30  7:27 UTC (permalink / raw)
  To: Zsh hackers list

Something very wrong seems to be happening here, not sure what or how that
could happen:

$ zsh -c 'LC_ALL=C; set -x 128 129; printf "%s\n" ${(#)@} | hexdump -C'
+zsh:1> printf '"%s\n"' ''
+zsh:1> hexdump -C
00000000  22 0a 22                                          |"."|
00000003

(with git HEAD as well) on Ubuntu 22.04 amd64.

Does not happen without LC_ALL=C or with just "set -x 128" or with values <
128, or with printf "%s\n" $'\200' $'\201'

-- 
Stephane


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: (LC_ALL=C; set -x 128 129; printf "%s\n" ${(#)@} | hexdump -C)
  2023-08-30  7:27 (LC_ALL=C; set -x 128 129; printf "%s\n" ${(#)@} | hexdump -C) Stephane Chazelas
@ 2023-08-30 19:45 ` Bart Schaefer
  2023-08-31 17:38   ` Jun. T
  0 siblings, 1 reply; 10+ messages in thread
From: Bart Schaefer @ 2023-08-30 19:45 UTC (permalink / raw)
  To: Zsh hackers list

On Wed, Aug 30, 2023 at 12:28 AM Stephane Chazelas
<stephane@chazelas.org> wrote:
>
> $ zsh -c 'LC_ALL=C; set -x 128 129; printf "%s\n" ${(#)@} | hexdump -C'
> +zsh:1> printf '"%s\n"' ''
> +zsh:1> hexdump -C
> 00000000  22 0a 22                                          |"."|
> 00000003

This doesn't happen if you add +o multibyte:

+zsh:1> printf '%s\n' $'\M-\C-@' $'\M-\C-A'
+zsh:1> hexdump -C
00000000  80 0a 81 0a                                       |....|
00000004

Doc for (#) says:

     If the MULTIBYTE option is set and the number is greater than 127
     (i.e.  not an ASCII character) it is treated as a Unicode
     character.

> Does not happen without LC_ALL=C or with just "set -x 128" or with values <
> 128, or with printf "%s\n" $'\200' $'\201'

So $'\200' isn't subject to unicode interpretation where ${(#):-128}
is.  But this also doesn't happen with 2 separate arguments
${(#):-128} ${(#):-129}, only with ${(#)@}.  So something about the
implementation of $@ is mucking up the unicode translation, possibly
by leaving it thinking it's in the middle of an incomplete byte
sequence?  It's also different with double quotes around $@:

% zsh -fc 'LC_ALL=C; set -x 128 130; printf "%s\n" "${(#)@}" | hexdump -C'
+zsh:1> printf '"%s\n"' '"'
+zsh:1> hexdump -C
00000000  22 22 0a 22                                       |""."|
00000004


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: (LC_ALL=C; set -x 128 129; printf "%s\n" ${(#)@} | hexdump -C)
  2023-08-30 19:45 ` Bart Schaefer
@ 2023-08-31 17:38   ` Jun. T
  2023-09-07 14:26     ` Jun. T
  0 siblings, 1 reply; 10+ messages in thread
From: Jun. T @ 2023-08-31 17:38 UTC (permalink / raw)
  To: zsh-workers


> 2023/08/31 4:45, Bart Schaefer <schaefer@brasslantern.com> wrote:
> 
> Doc for (#) says:
> 
>     If the MULTIBYTE option is set and the number is greater than 127
>     (i.e.  not an ASCII character) it is treated as a Unicode
>     character.

It seems that this translation succeeds only if UTF-8 locale is in use.
The translation is done in function substevalchar(), but in C locale it
fails to convert and fallbacks to 'no translation':

$ LC_ALL=C zsh -f
% n=128
% printf "%s\n" ${(#)n} | hexdump -C
00000000  80 0a                                             |..|
00000002

substevalchar() calls getkeystring() for the conversion, and
wctomb() (utils.c:6983) fails (returns -1), and zerr() sets
errflag.

If we use an array instead of sclar:

% a=( 128 129 )
% printf "%s\n" ${(#)a} | hexdump -C
00000000  22 0a 22                                          |"."|
00000003

In this case, errflag is set when translating a[0]=128, but it is
not reset and still set to 1 when substevalchar() is called for a[1].
Then substevalchar() immediately returns NULL (subst.c:1496).

Now paramsubst() sets haserr=1 (subst.c:3592) and returns NULL at
line 3609 (although errflag is reset at line 3606), 
and stringsubst() also returns NULL (line 327),
and prefork() quits at line 146 (probably before removing " " from
t"%s\n").

So we need to reset (or save/restore) errflag somewhere...

And, with multibyte option on but with C locale, what is the
correct behavior of ${(#)n} for n > 127 ?



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: (LC_ALL=C; set -x 128 129; printf "%s\n" ${(#)@} | hexdump -C)
  2023-08-31 17:38   ` Jun. T
@ 2023-09-07 14:26     ` Jun. T
  2023-09-07 16:33       ` Peter Stephenson
  0 siblings, 1 reply; 10+ messages in thread
From: Jun. T @ 2023-09-07 14:26 UTC (permalink / raw)
  To: zsh-workers


> 2023/09/01 2:38, I wrote:
> 
> So we need to reset (or save/restore) errflag somewhere...

The patch below is a simple save/restore errflag. It seems to work,
but I'm not sure it is the (or a) correct fix. I will not push this
unless I get positive responses.


By the way, the problem is not limited to the C locale. Even with
UTF-8 locale (tested on Linux. macOS may behave differently):

% a=( 0x8000_0000 0x8000_0001 )
% printf "%s\n" ${(#)a} | hexdump -C
00000000  22 0a 22                                          |"."|
00000003


Another problem (but I feel we don't need to fix it) is that
when we get error with the (X) flag seems to be rather arbitrary.
For positive values the lowest 4 bytes are used (subst.c:1503),
while for negative values only the lowest one byte is used
(subst.c:1511) and we never get warnings (tested on Fedora-38):

% printf "%s\n" ${(#X):-0x8000_0000}
zsh: character not in range
% printf "%s\n" ${(#X):-0x1_0000_0000} | hexdump -C
00000000  00 0a                                             |..|
00000002
% printf "%s\n" ${(#X):-0x1_8000_0000} | hexdump -C
zsh: character not in range
% printf "%s\n" ${(#X):--1} | hexdump -C
00000000  ff 0a                                             |..|
00000002
% printf "%s\n" ${(#X):--0x8000_0001} | hexdump -C
00000000  ff 0a                                             |..|
00000002



diff --git a/Src/subst.c b/Src/subst.c
index 14947ae36..6d9197d44 100644
--- a/Src/subst.c
+++ b/Src/subst.c
@@ -3572,7 +3572,7 @@ colonsubscript:
     if (errflag)
 	return NULL;
     if (evalchar) {
-	int one = noerrs, oef = errflag, haserr = 0;
+	int one = noerrs, haserr = 0;
 
 	if (!quoteerr)
 	    noerrs = 1;
@@ -3582,28 +3582,33 @@ colonsubscript:
 	 */
 	if (isarr) {
 	    char **aval2, **avptr, **av2ptr;
+	    int tmp_errflag = 0; /* errflag==0 at this point */
 
 	    aval2 = (char **)zhalloc((arrlen(aval)+1)*sizeof(char *));
 
 	    for (avptr = aval, av2ptr = aval2; *avptr; avptr++, av2ptr++)
 	    {
-		/* When noerrs = 1, the only error is out-of-memory */
-		if (!(*av2ptr = substevalchar(*avptr))) {
+		/* errflag must be cleared when calling substevalchar().
+		 * It will set errflag if conversion fails. */
+		errflag = 0;
+		*av2ptr = substevalchar(*avptr);
+		tmp_errflag |= errflag;
+		if (!*av2ptr) { /* not a valid numerical expression? */
 		    haserr = 1;
 		    break;
 		}
 	    }
+	    errflag = tmp_errflag;
 	    *av2ptr = NULL;
 	    aval = aval2;
 	} else {
-	    /* When noerrs = 1, the only error is out-of-memory */
 	    if (!(val = substevalchar(val)))
 		haserr = 1;
 	}
 	noerrs = one;
 	if (!quoteerr) {
 	    /* Retain user interrupt error status */
-	    errflag = oef | (errflag & ERRFLAG_INT);
+	    errflag &= ERRFLAG_INT;
 	}
 	if (haserr || errflag)
 	    return NULL;




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: (LC_ALL=C; set -x 128 129; printf "%s\n" ${(#)@} | hexdump -C)
  2023-09-07 14:26     ` Jun. T
@ 2023-09-07 16:33       ` Peter Stephenson
  2023-09-08 16:30         ` Jun. T
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Stephenson @ 2023-09-07 16:33 UTC (permalink / raw)
  To: zsh-workers

> On 07/09/2023 15:26 Jun. T <takimoto-j@kba.biglobe.ne.jp> wrote:
> > 2023/09/01 2:38, I wrote:
> > So we need to reset (or save/restore) errflag somewhere...
> 
> The patch below is a simple save/restore errflag. It seems to work,
> but I'm not sure it is the (or a) correct fix. I will not push this
> unless I get positive responses.

Looks like these are the only calls to substevalchar(), so perhaps
the changes could be made internal to that.  If the test for
errflag within substevalchar() is, as I presume, there simply because
something later in the function will fail if errflag is non-zero, that
might be the point to clear it.  Then the function can return the
error state at the end of the function just before it restores errflag.

pws


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: (LC_ALL=C; set -x 128 129; printf "%s\n" ${(#)@} | hexdump -C)
  2023-09-07 16:33       ` Peter Stephenson
@ 2023-09-08 16:30         ` Jun. T
  2023-09-11  8:57           ` Peter Stephenson
  0 siblings, 1 reply; 10+ messages in thread
From: Jun. T @ 2023-09-08 16:30 UTC (permalink / raw)
  To: zsh-workers


> 2023/09/08 1:33, Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> 
> Looks like these are the only calls to substevalchar(), so perhaps
> the changes could be made internal to that.

I changed paramsubst() just because we can assume errflag=0 at the
start of the block (and would make the things simpler.).

But, anyway, my previous patch was not complete.
Either with or without my previous patch (in any locale):

% echo ${(#X):-@}
zsh: bad math expression: illegal character: @

This is OK. But:

% printf "%s\n" ${(#):-@} | hexdump -C 
00000000  22 0a 22                                          |"."|
00000003

The quote removal is done in remnulargs() ( at subst.c:169).
So it seems that if noerrs is set (without (X) flag) then we should not
quit from prefork() at line 146. This means, I guess, substevalchar()
should not return NULL if noerrs is set. But if we want to continue
even if we have a bad math expression, only thing we can do is just
to return "" instead of NULL. The patch below (hopefuly) does this.
Any comment is welcome.


diff --git a/Src/subst.c b/Src/subst.c
index 14947ae36..d68159227 100644
--- a/Src/subst.c
+++ b/Src/subst.c
@@ -1489,11 +1489,18 @@ subst_parse_str(char **sp, int single, int err)
 static char *
 substevalchar(char *ptr)
 {
-    zlong ires = mathevali(ptr);
+    zlong ires;
     int len = 0;
+    int saved_errflag = errflag;
 
-    if (errflag)
-	return NULL;
+    errflag = 0;
+    ires = mathevali(ptr);
+
+    if (errflag) {  /* not a valid numerical expression */
+	errflag |= saved_errflag;
+	return noerrs ? dupstring(""): NULL;
+    }
+    errflag |= saved_errflag;
 #ifdef MULTIBYTE_SUPPORT
     if (isset(MULTIBYTE) && ires > 127) {
 	/* '\\' + 'U' + 8 bytes of character + '\0' */





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: (LC_ALL=C; set -x 128 129; printf "%s\n" ${(#)@} | hexdump -C)
  2023-09-08 16:30         ` Jun. T
@ 2023-09-11  8:57           ` Peter Stephenson
  2023-09-11 12:11             ` Jun. T
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Stephenson @ 2023-09-11  8:57 UTC (permalink / raw)
  To: Jun. T, zsh-workers

> On 08/09/2023 17:30 Jun. T <takimoto-j@kba.biglobe.ne.jp> wrote:
> > 2023/09/08 1:33, Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> > 
> > Looks like these are the only calls to substevalchar(), so perhaps
> > the changes could be made internal to that.
> 
> I changed paramsubst() just because we can assume errflag=0 at the
> start of the block (and would make the things simpler.).
> 
> But, anyway, my previous patch was not complete.
> Either with or without my previous patch (in any locale):
> 
> % echo ${(#X):-@}
> zsh: bad math expression: illegal character: @
> 
> This is OK. But:
> 
> % printf "%s\n" ${(#):-@} | hexdump -C 
> 00000000  22 0a 22                                          |"."|
> 00000003
> 
> The quote removal is done in remnulargs() ( at subst.c:169).
> So it seems that if noerrs is set (without (X) flag) then we should not
> quit from prefork() at line 146. This means, I guess, substevalchar()
> should not return NULL if noerrs is set. But if we want to continue
> even if we have a bad math expression, only thing we can do is just
> to return "" instead of NULL. The patch below (hopefuly) does this.
> Any comment is welcome.

I think that's fine --- it's the sort of thing where we'll only find
out if there are issues when someone comes up with a new corner case.
Thanks.

pws


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: (LC_ALL=C; set -x 128 129; printf "%s\n" ${(#)@} | hexdump -C)
  2023-09-11  8:57           ` Peter Stephenson
@ 2023-09-11 12:11             ` Jun. T
  2023-09-13  9:59               ` Jun T
  0 siblings, 1 reply; 10+ messages in thread
From: Jun. T @ 2023-09-11 12:11 UTC (permalink / raw)
  To: zsh-workers

This is some test for (#) and (#X).



diff --git a/Test/D04parameter.ztst b/Test/D04parameter.ztst
index 0d44558a7..12ae1a446 100644
--- a/Test/D04parameter.ztst
+++ b/Test/D04parameter.ztst
@@ -2785,3 +2785,43 @@ F:behavior, see http://austingroupbugs.net/view.php?id=888
 >string with spaces
 >stringwithspaces
 >stringwithspaces
+
+  : ${(#X):-@}
+1:${(#X)...}: bad math expression
+?(eval):1: bad math expression: illegal character: @
+
+  echo a${(#):-@}z
+0:${(#)...}: bad math expression
+>az
+
+  printf "a%sz\n" ${(#):-@}
+0:${(#)...}: bad math expression, printf
+>az
+
+  a=( '1 +' '@' )
+  : ${(#X)a}
+1:${(#X)...}: array of bad math expressions
+?(eval):2: bad math expression: operand expected at end of string
+
+  printf "a%sz\n" ${(#)a}
+0:${(#)...}: array of bad math expressions, printf
+>az
+
+  : ${(#X):-0x80}
+1:${(#X)...}: out-of-range character
+?(eval):1: character not in range
+
+  [[ ${(#):-0x80} = $'\x80' ]] && echo OK
+0:${(#)...}: out-of-range character
+>OK
+
+  a=( 0x80 0x81 )
+  : ${(#X)a}
+1:${(#X)...}: array of out-of-range characters
+?(eval):2: character not in range
+
+  printf "%s\n" ${(#)a} |
+  while read x; do echo $(( #x )); done
+0:${(#)...}: array of out-of-range characters
+>128
+>129




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: (LC_ALL=C; set -x 128 129; printf "%s\n" ${(#)@} | hexdump -C)
  2023-09-11 12:11             ` Jun. T
@ 2023-09-13  9:59               ` Jun T
  2023-09-21  9:26                 ` Jun T
  0 siblings, 1 reply; 10+ messages in thread
From: Jun T @ 2023-09-13  9:59 UTC (permalink / raw)
  To: zsh-workers


> 2023/09/11 21:11, Jun. T <takimoto-j@kba.biglobe.ne.jp> wrote:
> 
> This is some test for (#) and (#X).

Sorry, I've already pushed this, but the test fails on FreeBSD,
DragonFly and NetBSD for out-of-range characters.

On these OSes (with or without my patch):

% LC_ALL=C zsh -f
% echo ${(#):-0x80} | hexdump -C
00000000  3f 0a                                             |?.|
00000002

This is due to the peculiar behavior of iconv(3). It converts
out-of-range character to '?' (0x3f) with return value 1,
indicating that one character is converted in "non-reversible" way.

I feel this is a "bug" of iconv(), but maybe better to implement
some workaround. I'm getting rather busy now, but hopefully I can
work on this within a few days.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: (LC_ALL=C; set -x 128 129; printf "%s\n" ${(#)@} | hexdump -C)
  2023-09-13  9:59               ` Jun T
@ 2023-09-21  9:26                 ` Jun T
  0 siblings, 0 replies; 10+ messages in thread
From: Jun T @ 2023-09-21  9:26 UTC (permalink / raw)
  To: zsh-workers


> 2023/09/13 18:59, Jun T <takimoto-j@kba.biglobe.ne.jp> wrote:
> 
> the test fails on FreeBSD,
> DragonFly and NetBSD for out-of-range characters.
(snip)
> This is due to the peculiar behavior of iconv(3). It converts
> out-of-range character to '?' (0x3f) with return value 1,

This behavior of iconv(3) is explicitly documented in the manpage
and we can't say it's a bug, but anyway I think we should treat
the positive return value of iconv() in the same way as -1.

But simply replacing (utils.c:7046)
	if (count == (size_t)-1) {
by
	if (count) {
didn't work because of the complication due to errflag/noerrs.
So I moved the conversion code into a new function ucs4tomb().

Two more modifications:
[1] Negative value, such as ${(#X):--1}, is now an error.

[2] If __STDC_ISO_10646__ is not defined, for example in macOS, and
UTF-8 locale is in use, then ucs4toutf8() is used for the conversion.
This function now accepts only the range 0 - 0x7fff_ffff because
wctomb(3) on Linux (with UTF-8 locale) accepts this range (the old
range of UCS4).
# But now it seems UCS4 is equivalent to UTF-32 and limited to the
# range 0 - 0x10_ffff (and the maximum length of UTF-8 is 4 bytes).
# We can make ucs4toutf8() accept only this range, if that's better.
# This will also make $'\U110000' an error.


BTW, with or without my recent patch, if the (X) flag is not
given but conversion fails, then the lowest byte of the number
is output as a single byte character. Is this really useful?
If so, do we need to document it? Or we can just output ""?


diff --git a/Src/subst.c b/Src/subst.c
index dc2052ee0..347b1b8bd 100644
--- a/Src/subst.c
+++ b/Src/subst.c
@@ -1501,16 +1501,15 @@ substevalchar(char *ptr)
 	return noerrs ? dupstring(""): NULL;
     }
     errflag |= saved_errflag;
+    if (ires < 0) {
+	zerr("character not in range");
+    }
 #ifdef MULTIBYTE_SUPPORT
-    if (isset(MULTIBYTE) && ires > 127) {
-	/* '\\' + 'U' + 8 bytes of character + '\0' */
-	char buf[11];
-
-	/* inefficient: should separate out \U handling from getkeystring */
-	sprintf(buf, "\\U%.8x", (unsigned int)ires & 0xFFFFFFFFu);
-	ptr = getkeystring(buf, &len, GETKEYS_BINDKEY, NULL);
+    else if (isset(MULTIBYTE) && ires > 127) {
+	ptr = zhalloc(MB_CUR_MAX);
+	len = ucs4tomb((unsigned int)ires & 0xffffffff, ptr);
     }
-    if (len == 0)
+    if (len <= 0)
 #endif
     {
 	ptr = zhalloc(2);
diff --git a/Src/utils.c b/Src/utils.c
index 7040d0954..e8d2613b4 100644
--- a/Src/utils.c
+++ b/Src/utils.c
@@ -6671,12 +6671,15 @@ dquotedzputs(char const *s, FILE *stream)
 
 # if defined(HAVE_NL_LANGINFO) && defined(CODESET) && !defined(__STDC_ISO_10646__)
 /* Convert a character from UCS4 encoding to UTF-8 */
-
-static size_t
+  
+static int
 ucs4toutf8(char *dest, unsigned int wval)
 {
-    size_t len;
+    int len;
 
+    /* UCS4 is now equvalent to UTF-32 and limited to 0 - 0x10_FFFF.
+     * This function accepts 0 - 0x7FFF_FFFF (old range of UCS4) to be
+     * compatible with wctomb(3) (in UTF-8 locale) on Linux. */
     if (wval < 0x80)
       len = 1;
     else if (wval < 0x800)
@@ -6687,8 +6690,12 @@ ucs4toutf8(char *dest, unsigned int wval)
       len = 4;
     else if (wval < 0x4000000)
       len = 5;
-    else
+    else if (wval < 0x80000000)
       len = 6;
+    else {
+      zerr("character not in range");
+      return -1;
+    }
 
     switch (len) { /* falls through except to the last case */
     case 6: dest[5] = (wval & 0x3f) | 0x80; wval >>= 6;
@@ -6705,30 +6712,89 @@ ucs4toutf8(char *dest, unsigned int wval)
 }
 #endif
 
+/* Convert UCS4 to a multibyte character in current locale.
+ * Result is saved in buf (must be at least MB_CUR_MAX bytes long).
+ * Returns the number of bytes saved in buf, or -1 if conversion fails. */
 
-/*
- * The following only occurs once or twice in the code, but in different
- * places depending how character set conversion is implemented.
- */
-#define CHARSET_FAILED()		      \
-    if (how & GETKEY_DOLLAR_QUOTE) {	      \
-	while ((*tdest++ = *++s)) {	      \
-	    if (how & GETKEY_UPDATE_OFFSET) { \
-		if (s - sstart > *misc)	      \
-		    (*misc)++;		      \
-	    }				      \
-	    if (*s == Snull) {		      \
-		*len = (s - sstart) + 1;      \
-		*tdest = '\0';		      \
-		return buf;		      \
-	    }				      \
-	}				      \
-	*len = tdest - buf;		      \
-	return buf;			      \
-    }					      \
-    *t = '\0';				      \
-    *len = t - buf;			      \
-    return buf
+/**/
+int
+ucs4tomb(unsigned int wval, char *buf)
+{
+#if defined(HAVE_WCHAR_H) && defined(HAVE_WCTOMB) && defined(__STDC_ISO_10646__)
+    int count = wctomb(buf, (wchar_t)wval);
+    if (count == -1)
+	zerr("character not in range");
+    return count;
+#else	/* !(HAVE_WCHAR_H && HAVE_WCTOMB && __STDC_ISO_10646__) */
+# if defined(HAVE_NL_LANGINFO) && defined(CODESET)
+    if (!strcmp(nl_langinfo(CODESET), "UTF-8")) {
+	return ucs4toutf8(buf, wval);
+    } else {
+#   ifdef HAVE_ICONV
+	iconv_t cd;
+	char inbuf[4], *bsave = buf;
+	ICONV_CONST char *inptr = inbuf;
+	size_t inbytes = 4, outbytes = 6;
+	const char *codesetstr = nl_langinfo(CODESET);
+	size_t count;
+	int i;
+
+	/*
+	 * If the code set isn't handled, we'd better assume it's US-ASCII
+	 * rather than just failing hopelessly.  Solaris has a weird habit
+	 * of returning 646.  This is handled by the native iconv(), but
+	 * not by GNU iconv; what's more, some versions of the native iconv
+	 * don't handle standard names like ASCII.
+	 *
+	 * This should only be a problem if there's a mismatch between the
+	 * NLS and the iconv in use, which probably only means if libiconv
+	 * is in use.  We checked at configure time if our libraries pulled
+	 * in _libiconv_version, which should be a good test.
+	 *
+	 * It shouldn't ever be NULL, but while we're being paranoid...
+	 */
+#     ifdef ICONV_FROM_LIBICONV
+	if (!codesetstr || !*codesetstr)
+	    codesetstr = "US-ASCII";
+#     endif
+	cd = iconv_open(codesetstr, "UCS-4BE");
+#     ifdef ICONV_FROM_LIBICONV
+	if (cd == (iconv_t)-1 &&  !strcmp(codesetstr, "646")) {
+	    codesetstr = "US-ASCII";
+	    cd = iconv_open(codesetstr, "UCS-4BE");
+	}
+#     endif
+	if (cd == (iconv_t)-1) {
+	    zerr("cannot do charset conversion (iconv failed)");
+	    return -1;
+	}
+
+	/* store value in big endian form */
+	for (i=3; i>=0; i--) {
+	    inbuf[i] = wval & 0xff;
+	    wval >>= 8;
+	}
+	count = iconv(cd, &inptr, &inbytes, &buf, &outbytes);
+	iconv_close(cd);
+	if (count) {
+	    /* -1 indicates error. Positive value means number of "invalid"
+	     * (or "non-reversible") conversions, which we consider as
+	     * "out-of-range" characters. */
+	    zerr("character not in range");
+	    return -1;
+	}
+	return buf - bsave;
+#   else    /* !HAVE_ICONV */
+	zerr("cannot do charset conversion (iconv not available)");
+	return -1;
+#   endif   /* HAVE_ICONV */
+    }
+# else	/* !(HAVE_NL_LANGINFO && CODESET) */
+    zerr("cannot do charset conversion (NLS not supported)");
+    return -1;
+# endif	/* HAVE_NL_LANGINFO && CODESET */
+#endif	/* HAVE_WCHAR_H && HAVE_WCTOMB && __STDC_ISO_10646__ */
+}
 
 /*
  * Decode a key string, turning it into the literal characters.
@@ -6785,21 +6851,6 @@ getkeystring(char *s, int *len, int how, int *misc)
     char *t, *tdest = NULL, *u = NULL, *sstart = s, *tbuf = NULL;
     char svchar = '\0';
     int meta = 0, control = 0, ignoring = 0;
-    int i;
-#if defined(HAVE_WCHAR_H) && defined(HAVE_WCTOMB) && defined(__STDC_ISO_10646__)
-    wint_t wval;
-    int count;
-#else
-    unsigned int wval;
-# if defined(HAVE_NL_LANGINFO) && defined(CODESET)
-#  if defined(HAVE_ICONV)
-    iconv_t cd;
-    char inbuf[4];
-    size_t inbytes, outbytes;
-#  endif
-    size_t count;
-# endif
-#endif
 
     DPUTS((how & GETKEY_UPDATE_OFFSET) &&
 	  (how & ~(GETKEYS_DOLLARS_QUOTE|GETKEY_UPDATE_OFFSET)),
@@ -6864,7 +6915,8 @@ getkeystring(char *s, int *len, int how, int *misc)
     }
     for (; *s; s++) {
 	if (*s == '\\' && s[1]) {
-	    int miscadded;
+	    int miscadded, count, i;
+	    unsigned int wval;
 	    if ((how & GETKEY_UPDATE_OFFSET) && s - sstart < *misc) {
 		(*misc)--;
 		miscadded = 1;
@@ -6979,86 +7031,32 @@ getkeystring(char *s, int *len, int how, int *misc)
 		    *misc = wval;
 		    return s+1;
 		}
-#if defined(HAVE_WCHAR_H) && defined(HAVE_WCTOMB) && defined(__STDC_ISO_10646__)
-		count = wctomb(t, (wchar_t)wval);
+		count = ucs4tomb(wval, t);
 		if (count == -1) {
-		    zerr("character not in range");
-		    CHARSET_FAILED();
+		    if (how & GETKEY_DOLLAR_QUOTE) {
+			while ((*tdest++ = *++s)) {
+			    if (how & GETKEY_UPDATE_OFFSET) {
+				if (s - sstart > *misc)
+				    (*misc)++;
+			    }
+			    if (*s == Snull) {
+				*len = (s - sstart) + 1;
+				*tdest = '\0';
+				return buf;
+			    }
+			}
+			*len = tdest - buf;
+		    }
+		    else {
+			*t = '\0';
+			*len = t - buf;
+		    }
+		    return buf;
 		}
 		if ((how & GETKEY_UPDATE_OFFSET) && s - sstart < *misc)
 		    (*misc) += count;
 		t += count;
-# else
-#  if defined(HAVE_NL_LANGINFO) && defined(CODESET)
-		if (!strcmp(nl_langinfo(CODESET), "UTF-8")) {
-		    count = ucs4toutf8(t, wval);
-		    t += count;
-		    if ((how & GETKEY_UPDATE_OFFSET) && s - sstart < *misc)
-			(*misc) += count;
-		} else {
-#   ifdef HAVE_ICONV
-		    ICONV_CONST char *inptr = inbuf;
-		    const char *codesetstr = nl_langinfo(CODESET);
-    	    	    inbytes = 4;
-		    outbytes = 6;
-		    /* store value in big endian form */
-		    for (i=3;i>=0;i--) {
-			inbuf[i] = wval & 0xff;
-			wval >>= 8;
-		    }
 
-		    /*
-		     * If the code set isn't handled, we'd better
-		     * assume it's US-ASCII rather than just failing
-		     * hopelessly.  Solaris has a weird habit of
-		     * returning 646.  This is handled by the
-		     * native iconv(), but not by GNU iconv; what's
-		     * more, some versions of the native iconv don't
-		     * handle standard names like ASCII.
-		     *
-		     * This should only be a problem if there's a
-		     * mismatch between the NLS and the iconv in use,
-		     * which probably only means if libiconv is in use.
-		     * We checked at configure time if our libraries
-		     * pulled in _libiconv_version, which should be
-		     * a good test.
-		     *
-		     * It shouldn't ever be NULL, but while we're
-		     * being paranoid...
-		     */
-#ifdef ICONV_FROM_LIBICONV
-		    if (!codesetstr || !*codesetstr)
-			codesetstr = "US-ASCII";
-#endif
-    	    	    cd = iconv_open(codesetstr, "UCS-4BE");
-#ifdef ICONV_FROM_LIBICONV
-		    if (cd == (iconv_t)-1 &&  !strcmp(codesetstr, "646")) {
-			codesetstr = "US-ASCII";
-			cd = iconv_open(codesetstr, "UCS-4BE");
-		    }
-#endif
-		    if (cd == (iconv_t)-1) {
-			zerr("cannot do charset conversion (iconv failed)");
-			CHARSET_FAILED();
-		    }
-                    count = iconv(cd, &inptr, &inbytes, &t, &outbytes);
-		    iconv_close(cd);
-		    if (count == (size_t)-1) {
-                        zerr("character not in range");
-			CHARSET_FAILED();
-		    }
-		    if ((how & GETKEY_UPDATE_OFFSET) && s - sstart < *misc)
-			(*misc) += count;
-#   else
-                    zerr("cannot do charset conversion (iconv not available)");
-		    CHARSET_FAILED();
-#   endif
-		}
-#  else
-                zerr("cannot do charset conversion (NLS not supported)");
-		CHARSET_FAILED();
-#  endif
-# endif
 		if (how & GETKEY_DOLLAR_QUOTE) {
 		    char *t2;
 		    for (t2 = tbuf; t2 < t; t2++) {





^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-09-21  9:26 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-30  7:27 (LC_ALL=C; set -x 128 129; printf "%s\n" ${(#)@} | hexdump -C) Stephane Chazelas
2023-08-30 19:45 ` Bart Schaefer
2023-08-31 17:38   ` Jun. T
2023-09-07 14:26     ` Jun. T
2023-09-07 16:33       ` Peter Stephenson
2023-09-08 16:30         ` Jun. T
2023-09-11  8:57           ` Peter Stephenson
2023-09-11 12:11             ` Jun. T
2023-09-13  9:59               ` Jun T
2023-09-21  9:26                 ` Jun T

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).