zsh-workers
 help / color / mirror / code / Atom feed
* [bug] busyloop upon $=var with NULs when $IFS contains both NUL and a byte > 0x7f
@ 2022-11-18 14:27 Stephane Chazelas
  2022-11-29 14:27 ` Jun. T
  2022-11-30 14:56 ` Jun. T
  0 siblings, 2 replies; 14+ messages in thread
From: Stephane Chazelas @ 2022-11-18 14:27 UTC (permalink / raw)
  To: Zsh hackers list

$ LC_ALL=C zsh -c 'IFS=é$IFS; echo $=IFS'
^C

(busy loop had to be interrupted with ^C).

Call trace:

(gdb) bt
#0  0x00005555556249ad in mb_metacharlenconv (s=0x555555675185 "\203 ", wcp=0x7fffffffc760) at utils.c:5541
#1  0x0000555555622791 in itype_end (ptr=0x555555675185 "\203 ", itype=32, once=1) at utils.c:4332
#2  0x000055555562121e in wordcount (s=0x555555675185 "\203 ", sep=0x0, mul=-1) at utils.c:3835
#3  0x0000555555620b29 in spacesplit (s=0x555555675180 "\303\251 \t\n\203 ", allownull=0, heap=1, quote=0) at utils.c:3650
#4  0x0000555555621495 in sepsplit (s=0x555555675180 "\303\251 \t\n\203 ", sep=0x0, allownull=0, heap=1) at utils.c:3908
#5  0x0000555555612f55 in paramsubst (l=0x7ffff7fbf560, n=0x7ffff7fbf590, str=0x7fffffffce00, qt=0, pf_flags=0, ret_flags=0x7fffffffcfcc) at subst.c:3660
#6  0x000055555560bb37 in stringsubst (list=0x7ffff7fbf560, node=0x7ffff7fbf590, pf_flags=0, ret_flags=0x7fffffffcfcc, asssub=0) at subst.c:322
#7  0x000055555560abbc in prefork (list=0x7ffff7fbf560, flags=0, ret_flags=0x7fffffffcfcc) at subst.c:142
#8  0x0000555555595cfd in execcmd_exec (state=0x7fffffffd940, eparams=0x7fffffffd540, input=0, output=0, how=18, last1=1, close_if_forked=-1) at exec.c:3232
#9  0x0000555555592757 in execpline2 (state=0x7fffffffd940, pcode=131, how=18, input=0, output=0, last1=1) at exec.c:1966
#10 0x0000555555590f30 in execpline (state=0x7fffffffd940, slcode=4098, how=18, last1=1) at exec.c:1691
#11 0x000055555559009a in execlist (state=0x7fffffffd940, dont_change_job=0, exiting=1) at exec.c:1444
#12 0x000055555558f735 in execode (p=0x7ffff7fbf448, dont_change_job=0, exiting=1, context=0x55555562e108 "cmdarg") at exec.c:1221
#13 0x000055555558f60c in execstring (s=0x7fffffffdedc "IFS=\303\251$IFS; echo $=IFS", dont_change_job=0, exiting=1, context=0x55555562e108 "cmdarg")
    at exec.c:1187
#14 0x00005555555ba5f9 in init_misc (cmd=0x7fffffffdedc "IFS=\303\251$IFS; echo $=IFS", zsh_name=0x7fffffffded5 "zsh") at init.c:1389
#15 0x00005555555bbd2d in zsh_main (argc=3, argv=0x7fffffffdb58) at init.c:1780
#16 0x000055555556ad29 in main (argc=3, argv=0x7fffffffdb58) at ./main.c:93


With +o multibyte, no busy loop, but splitting doesn't work properly:

$ LC_ALL=C zsh +o multibyte -c 'IFS=é$IFS; printf "<%q>\n" $=IFS'
<$'\303'$'\251'>
<''>

That's triggered when IFS contains both NUL and a byte over 0x7f
(in any order) and when the variable to split contains NUL.

In UTF-8 locales, that's triggered when IFS contains NUL and
bytes or byte sequences not forming parts of valid characters.

"read" doesn't seem to be affected:

$ print 'foo\0bar' | LC_ALL=C zsh -c 'IFS=é$IFS read -rA a; typeset a'
a=( foo bar )

(that's on Debian GNU/Linux amd64 with zsh git HEAD)

-- 
Stephane



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bug] busyloop upon $=var with NULs when $IFS contains both NUL and a byte > 0x7f
  2022-11-18 14:27 [bug] busyloop upon $=var with NULs when $IFS contains both NUL and a byte > 0x7f Stephane Chazelas
@ 2022-11-29 14:27 ` Jun. T
  2022-11-29 14:38   ` Peter Stephenson
                     ` (2 more replies)
  2022-11-30 14:56 ` Jun. T
  1 sibling, 3 replies; 14+ messages in thread
From: Jun. T @ 2022-11-29 14:27 UTC (permalink / raw)
  To: zsh-workers


> 2022/11/18 23:27, Stephane Chazelas <stephane@chazelas.org> wrote:
> 
> $ LC_ALL=C zsh -c 'IFS=é$IFS; echo $=IFS'
> ^C
> 
> (busy loop had to be interrupted with ^C).

This not simple to solve. The basic question is:
   What should we do if IFS contains invalid characters?

When IFS changes, ifssetfn() calls inittyptab(), and it then calls
set_widearay() (at line 4172 in utils.c) to set the structure
ifs_wide. The origin of the problem seems to be in this function
(also in utils.c):

  95             mblen = mb_metacharlenconv(mb_array, &wci);
  ..
  99             /* No good unless all characters are convertible */
 100             if (wci == WEOF)
 101                 return;

mb_array is the current IFS (metafied), and it contains
é = \xc3\xa9. In the C locale (and at least on Linux), \xc3 is
an invalid character, and wci is set to WEOF. Then the function
returns without setting ifs_wide (ifs_wide.chars=NULL and
ifs_wide.len=0).

The comment at line 99 may look reasonable, but leaving ifs_wide
empty is equally 'no good', I think.

Due to this empty ifs_wide, itype_end() (and wcsitype()) doesn't
work as expected (for character >= \x80).

The 'busy loop' is in wordcount() (utils.c):

3834         for (; *s; r++) {                                                 
3835             char *ie = itype_end(s, ISEP, 1);               
3836             if (ie != s) {                                             
3837                 s = ie;                                                
....                          
3840             }                                          
3841             (void)findsep(&s, NULL, 0);
....
3845         }

Here, the pointer s already points to a ISEP (\x83\x20 = metafied Nul),
but itype_end() can't find the next ISEP (ie == s) due to the empty
ifs_wide, and findsep() does not move s because *s is already ISEP,
resulting in infinite-loop with the same s.

So the basic question is:
What should we do if IFS contains invalid character(s)?

I think, at least if MULTIBYTE option is ON, it would be better to
force reset IFS to the default, rather than leaving ifs_wide empty.

Or store only valid characters in ifs_side.chars?

BTW, in set_widearay():

  89             if (STOUC(*mb_array) <= 0x7f) {
  90                 mb_array++;
  91                 *wcptr++ = (wchar_t)*mb_array;

I think the lines 90,91 should be
	*wcptr++ = (wchar_t)*mb_array++;
But fixing this does not solve the current problem.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bug] busyloop upon $=var with NULs when $IFS contains both NUL and a byte > 0x7f
  2022-11-29 14:27 ` Jun. T
@ 2022-11-29 14:38   ` Peter Stephenson
  2022-11-30  4:20     ` Bart Schaefer
  2022-12-13  9:49     ` Jun T
  2022-12-11 19:12   ` Stephane Chazelas
  2022-12-13  9:51   ` Jun T
  2 siblings, 2 replies; 14+ messages in thread
From: Peter Stephenson @ 2022-11-29 14:38 UTC (permalink / raw)
  To: zsh-workers

> On 29/11/2022 14:27 Jun. T <takimoto-j@kba.biglobe.ne.jp> wrote:
> 3834         for (; *s; r++) {                                                 
> 3835             char *ie = itype_end(s, ISEP, 1);               
> 3836             if (ie != s) {                                             
> 3837                 s = ie;                                                
> ....                          
> 3840             }                                          
> 3841             (void)findsep(&s, NULL, 0);
> ....
> 3845         }
> 
> Here, the pointer s already points to a ISEP (\x83\x20 = metafied Nul),
> but itype_end() can't find the next ISEP (ie == s) due to the empty
> ifs_wide, and findsep() does not move s because *s is already ISEP,
> resulting in infinite-loop with the same s.

I guess the obvious thing to do here is any time s doesn't move
give up splitting at that point.

pws


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bug] busyloop upon $=var with NULs when $IFS contains both NUL and a byte > 0x7f
  2022-11-29 14:38   ` Peter Stephenson
@ 2022-11-30  4:20     ` Bart Schaefer
  2022-11-30  9:21       ` Peter Stephenson
  2022-12-13  9:49     ` Jun T
  1 sibling, 1 reply; 14+ messages in thread
From: Bart Schaefer @ 2022-11-30  4:20 UTC (permalink / raw)
  To: zsh-workers

On Tue, Nov 29, 2022 at 6:41 AM Peter Stephenson
<p.w.stephenson@ntlworld.com> wrote:
>
> I guess the obvious thing to do here is any time s doesn't move
> give up splitting at that point.

I reported the infinite loop back in workers/50472 ("out of memory
error" thread), although the cause doesn't appear to be exactly the
same.

I suppose if s doesn't move we could forcibly advance to see if
another possible split occurs later.  Unless there's clearly a way to
determine that the lack of movement is due to upstream user error
(such as invalid IFS value in this case)?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bug] busyloop upon $=var with NULs when $IFS contains both NUL and a byte > 0x7f
  2022-11-30  4:20     ` Bart Schaefer
@ 2022-11-30  9:21       ` Peter Stephenson
  2022-12-13  9:50         ` Jun T
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Stephenson @ 2022-11-30  9:21 UTC (permalink / raw)
  To: zsh-workers

> On 30/11/2022 04:20 Bart Schaefer <schaefer@brasslantern.com> wrote
> On Tue, Nov 29, 2022 at 6:41 AM Peter Stephenson
> <p.w.stephenson@ntlworld.com> wrote:
> >
> > I guess the obvious thing to do here is any time s doesn't move
> > give up splitting at that point.
> 
> I reported the infinite loop back in workers/50472 ("out of memory
> error" thread), although the cause doesn't appear to be exactly the
> same.
> 
> I suppose if s doesn't move we could forcibly advance to see if
> another possible split occurs later.  Unless there's clearly a way to
> determine that the lack of movement is due to upstream user error
> (such as invalid IFS value in this case)?

Engaging finger-in-the-air mode suggests in a case like this just trying
to be safe is the best we've got, unless we have proof positive we can
do something better in some likely scenario --- Jun may have a feel
for whether there's a potential good (as opposed to not completely
broken) outcome of the case he's looking at, in which case we can be
clever, otherwise I don't think it's worth it.

pws


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bug] busyloop upon $=var with NULs when $IFS contains both NUL and a byte > 0x7f
  2022-11-18 14:27 [bug] busyloop upon $=var with NULs when $IFS contains both NUL and a byte > 0x7f Stephane Chazelas
  2022-11-29 14:27 ` Jun. T
@ 2022-11-30 14:56 ` Jun. T
  1 sibling, 0 replies; 14+ messages in thread
From: Jun. T @ 2022-11-30 14:56 UTC (permalink / raw)
  To: zsh-workers


> 2022/11/18 23:27, Stephane Chazelas <stephane@chazelas.org> wrote:
> 
> With +o multibyte, no busy loop, but splitting doesn't work properly:
> 
> $ LC_ALL=C zsh +o multibyte -c 'IFS=é$IFS; printf "<%q>\n" $=IFS'
> <$'\303'$'\251'>
> <''>

It seems this can be fixed by the following patch
(use the multibyte code only if MULTIBYTE option is on).

The test script above gives
<''>
<''>
<''>
<''>

I gess this is the expected result (the description of IFS in man
zshparam(1) is not easy to understand).

If this works OK, then I think we can force reset IFS if an invalid
character is found in it when multibyte option is on, because
if a user wants (in C locale) to include any byte in IFS then she/he
can unset multibyte option.



diff --git a/Src/utils.c b/Src/utils.c
index edf5d3df7..a182553e7 100644
--- a/Src/utils.c
+++ b/Src/utils.c
@@ -74,9 +74,6 @@ set_widearray(char *mb_array, Widechar_array wca)
     }
     wca->len = 0;
 
-    if (!isset(MULTIBYTE))
-	return;
-
     if (mb_array) {
 	VARARR(wchar_t, tmpwcs, strlen(mb_array));
 	wchar_t *wcptr = tmpwcs;
@@ -4118,8 +4115,9 @@ inittyptab(void)
      * having IIDENT here is a good idea at all, but this code
      * should disappear into history...
      */
-    for (t0 = 0240; t0 != 0400; t0++)
-	typtab[t0] = IALPHA | IALNUM | IIDENT | IUSER | IWORD;
+    if isset(MULTIBYTE)
+	for (t0 = 0240; t0 != 0400; t0++)
+	    typtab[t0] = IALPHA | IALNUM | IIDENT | IUSER | IWORD;
 #endif
     /* typtab['.'] |= IIDENT; */ /* Allow '.' in variable names - broken */
     typtab['_'] = IIDENT | IUSER;
@@ -4138,7 +4136,7 @@ inittyptab(void)
 	DEFAULT_IFS_SH : DEFAULT_IFS; *s; s++) {
 	int c = STOUC(*s == Meta ? *++s ^ 32 : *s);
 #ifdef MULTIBYTE_SUPPORT
-	if (!isascii(c)) {
+	if (isset(MULTIBYTE) && !isascii(c)) {
 	    /* see comment for wordchars below */
 	    continue;
 	}
@@ -4154,7 +4152,7 @@ inittyptab(void)
     for (s = wordchars ? wordchars : DEFAULT_WORDCHARS; *s; s++) {
 	int c = STOUC(*s == Meta ? *++s ^ 32 : *s);
 #ifdef MULTIBYTE_SUPPORT
-	if (!isascii(c)) {
+	if (isset(MULTIBYTE) && !isascii(c)) {
 	    /*
 	     * If we have support for multibyte characters, we don't
 	     * handle non-ASCII characters here; instead, we turn
@@ -4168,9 +4166,11 @@ inittyptab(void)
 	typtab[c] |= IWORD;
     }
 #ifdef MULTIBYTE_SUPPORT
-    set_widearray(wordchars, &wordchars_wide);
-    set_widearray(ifs ? ifs : EMULATION(EMULATE_KSH|EMULATE_SH) ?
-	DEFAULT_IFS_SH : DEFAULT_IFS, &ifs_wide);
+    if (isset(MULTIBYTE)) {
+	set_widearray(wordchars, &wordchars_wide);
+	set_widearray(ifs ? ifs : EMULATION(EMULATE_KSH|EMULATE_SH) ?
+	    DEFAULT_IFS_SH : DEFAULT_IFS, &ifs_wide);
+    }
 #endif
     for (s = SPECCHARS; *s; s++)
 	typtab[STOUC(*s)] |= ISPECIAL;



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bug] busyloop upon $=var with NULs when $IFS contains both NUL and a byte > 0x7f
  2022-11-29 14:27 ` Jun. T
  2022-11-29 14:38   ` Peter Stephenson
@ 2022-12-11 19:12   ` Stephane Chazelas
  2022-12-13  9:51   ` Jun T
  2 siblings, 0 replies; 14+ messages in thread
From: Stephane Chazelas @ 2022-12-11 19:12 UTC (permalink / raw)
  To: Jun. T; +Cc: zsh-workers

2022-11-29 23:27:10 +0900, Jun. T:
[...]
> This not simple to solve. The basic question is:
>    What should we do if IFS contains invalid characters?
[...]

As also mentioned in the read -d $'\200' thread, in some other
parts of the code, bytes that can't be decoded into characters
in the locale's charmap are decoded as a 0xdc00 + byte_value.
(see See workers/36411 workers/36415)

So here, in a UTF-8 locale, IFS=$'\200' could be decoded as a
0xdc80 wchar_t and split on 0xdc80 wchar_t's only in the input
(0x80 byte values not found to be part of a character).

-- 
Stephane


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bug] busyloop upon $=var with NULs when $IFS contains both NUL and a byte > 0x7f
  2022-11-29 14:38   ` Peter Stephenson
  2022-11-30  4:20     ` Bart Schaefer
@ 2022-12-13  9:49     ` Jun T
  2022-12-13 10:13       ` Peter Stephenson
  1 sibling, 1 reply; 14+ messages in thread
From: Jun T @ 2022-12-13  9:49 UTC (permalink / raw)
  To: zsh-workers

Sorry for very late reply.

> 2022/11/29 23:38, Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> 
>> On 29/11/2022 14:27 Jun. T <takimoto-j@kba.biglobe.ne.jp> wrote:
>> 3834         for (; *s; r++) {                                                 
>> 3835             char *ie = itype_end(s, ISEP, 1);               
>> 3836             if (ie != s) {                                             
>> 3837                 s = ie;                                                
>> ....                          
>> 3840             }                                          
>> 3841             (void)findsep(&s, NULL, 0);
>> ....
>> 3845         }
>> 
>> Here, the pointer s already points to a ISEP (\x83\x20 = metafied Nul),
>> but itype_end() can't find the next ISEP (ie == s) due to the empty
>> ifs_wide, and findsep() does not move s because *s is already ISEP,
>> resulting in infinite-loop with the same s.
> 
> I guess the obvious thing to do here is any time s doesn't move
> give up splitting at that point.

How can we 'give up'?  Is it OK to call exit()?



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bug] busyloop upon $=var with NULs when $IFS contains both NUL and a byte > 0x7f
  2022-11-30  9:21       ` Peter Stephenson
@ 2022-12-13  9:50         ` Jun T
  0 siblings, 0 replies; 14+ messages in thread
From: Jun T @ 2022-12-13  9:50 UTC (permalink / raw)
  To: zsh-workers


> 2022/11/30 18:21, Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> 
>> On 30/11/2022 04:20 Bart Schaefer <schaefer@brasslantern.com> wrote
>> On Tue, Nov 29, 2022 at 6:41 AM Peter Stephenson
>> <p.w.stephenson@ntlworld.com> wrote:
>>> 
>>> I guess the obvious thing to do here is any time s doesn't move
>>> give up splitting at that point.
>> 
>> I reported the infinite loop back in workers/50472 ("out of memory
>> error" thread), although the cause doesn't appear to be exactly the
>> same.

I hope the problem (workers/50472) has been fixed by workers/50851⁩
(commit f8d9388).

>> I suppose if s doesn't move we could forcibly advance to see if
>> another possible split occurs later.  Unless there's clearly a way to
>> determine that the lack of movement is due to upstream user error
>> (such as invalid IFS value in this case)?

That would be better, but:

> Engaging finger-in-the-air mode suggests in a case like this just trying
> to be safe is the best we've got, unless we have proof positive we can
> do something better in some likely scenario

And, if we modify wordcount() then we need to modify sepsplit() and
spacesplit() (and others?). That may be doable, but I want to leave
it for future (for someone else?).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bug] busyloop upon $=var with NULs when $IFS contains both NUL and a byte > 0x7f
  2022-11-29 14:27 ` Jun. T
  2022-11-29 14:38   ` Peter Stephenson
  2022-12-11 19:12   ` Stephane Chazelas
@ 2022-12-13  9:51   ` Jun T
  2 siblings, 0 replies; 14+ messages in thread
From: Jun T @ 2022-12-13  9:51 UTC (permalink / raw)
  To: zsh-workers


> 2022/11/29 23:27, Jun. T <takimoto-j@kba.biglobe.ne.jp> wrote:
> 
> So the basic question is:
> What should we do if IFS contains invalid character(s)?
> 
> I think, at least if MULTIBYTE option is ON, it would be better to
> force reset IFS to the default, rather than leaving ifs_wide empty.

So currently this is the only simple solution I can think of.

In the patch below, if MULTIBYTE option is ON and IFS contains
invalid characters, it is reset to the default. Is this OK?

Do we need to issue a warning when reseting IFS?


The patch includes the patch in workers/51087⁩ (fix the behavior
when MULTIBYTE option is OFF).

When LC_CTYPE changes (directly or via LC_ALL or LANG), a
character that was valid would become invalid in the new locale.
So I added inittyptab() in lcsetfn() etc.

A simple test is included. On macOS, with C-locale, any byte
is a valid character, and IFS is not reset by the test.


 Doc/Zsh/params.yo      |  7 +++++--
 Src/params.c           |  3 +++
 Src/utils.c            | 42 ++++++++++++++++++++++++++----------------
 Test/D04parameter.ztst | 12 ++++++++++++
 4 files changed, 46 insertions(+), 18 deletions(-)

diff --git a/Doc/Zsh/params.yo b/Doc/Zsh/params.yo
index 2a30085a8..91201616a 100644
--- a/Doc/Zsh/params.yo
+++ b/Doc/Zsh/params.yo
@@ -1251,15 +1251,18 @@ Internal field separators (by default space, tab, newline and NUL), that
 are used to separate words which result from
 command or parameter expansion and words read by
 the tt(read) builtin.  Any characters from the set space, tab and
-newline that appear in the IFS are called em(IFS white space).
+newline that appear in the tt(IFS) are called em(IFS white space).
 One or more IFS white space characters or one non-IFS white space
 character together with any adjacent IFS white space character delimit
 a field.  If an IFS white space character appears twice consecutively
-in the IFS, this character is treated as if it were not an IFS white
+in the tt(IFS), this character is treated as if it were not an IFS white
 space character.
 
 If the parameter is unset, the default is used.  Note this has
 a different effect from setting the parameter to an empty string.
+
+If tt(MULTIBYTE) option is on and tt(IFS) contains invalid characters in
+the current locale, it is reset to the default.
 )
 vindex(KEYBOARD_HACK)
 item(tt(KEYBOARD_HACK))(
diff --git a/Src/params.c b/Src/params.c
index f1fe38955..81f0e5015 100644
--- a/Src/params.c
+++ b/Src/params.c
@@ -4639,6 +4639,7 @@ setlang(char *x)
 	if ((x = getsparam_u(ln->name)) && *x)
 	    setlocale(ln->category, x);
     unqueue_signals();
+    inittyptab();
 }
 
 /**/
@@ -4662,6 +4663,7 @@ lc_allsetfn(Param pm, char *x)
     else {
 	setlocale(LC_ALL, unmeta(x));
 	clear_mbstate();
+	inittyptab();
     }
 }
 
@@ -4700,6 +4702,7 @@ lcsetfn(Param pm, char *x)
     }
     unqueue_signals();
     clear_mbstate();	/* LC_CTYPE may have changed */
+    inittyptab();
 }
 #endif /* USE_LOCALE */
 
diff --git a/Src/utils.c b/Src/utils.c
index edf5d3df7..a874851cc 100644
--- a/Src/utils.c
+++ b/Src/utils.c
@@ -74,9 +74,6 @@ set_widearray(char *mb_array, Widechar_array wca)
     }
     wca->len = 0;
 
-    if (!isset(MULTIBYTE))
-	return;
-
     if (mb_array) {
 	VARARR(wchar_t, tmpwcs, strlen(mb_array));
 	wchar_t *wcptr = tmpwcs;
@@ -87,8 +84,7 @@ set_widearray(char *mb_array, Widechar_array wca)
 	    int mblen;
 
 	    if (STOUC(*mb_array) <= 0x7f) {
-		mb_array++;
-		*wcptr++ = (wchar_t)*mb_array;
+		*wcptr++ = (wchar_t)*mb_array++;
 		continue;
 	    }
 
@@ -4118,8 +4114,9 @@ inittyptab(void)
      * having IIDENT here is a good idea at all, but this code
      * should disappear into history...
      */
-    for (t0 = 0240; t0 != 0400; t0++)
-	typtab[t0] = IALPHA | IALNUM | IIDENT | IUSER | IWORD;
+    if isset(MULTIBYTE)
+	for (t0 = 0240; t0 != 0400; t0++)
+	    typtab[t0] = IALPHA | IALNUM | IIDENT | IUSER | IWORD;
 #endif
     /* typtab['.'] |= IIDENT; */ /* Allow '.' in variable names - broken */
     typtab['_'] = IIDENT | IUSER;
@@ -4134,11 +4131,24 @@ inittyptab(void)
 	typtab[t0] |= ITOK | IMETA;
     for (t0 = (int)STOUC(Snull); t0 <= (int)STOUC(Nularg); t0++)
 	typtab[t0] |= ITOK | IMETA | INULL;
-    for (s = ifs ? ifs : EMULATION(EMULATE_KSH|EMULATE_SH) ?
-	DEFAULT_IFS_SH : DEFAULT_IFS; *s; s++) {
+    /* ifs */
+#define CURRENT_DEFAULT_IFS (EMULATION(EMULATE_KSH|EMULATE_SH) ? \
+			    DEFAULT_IFS_SH : DEFAULT_IFS)
+#ifdef MULTIBYTE_SUPPORT
+    if (isset(MULTIBYTE)) {
+	set_widearray(ifs ? ifs : CURRENT_DEFAULT_IFS, &ifs_wide);
+	if (ifs && !ifs_wide.chars) {
+	    /* IFS has invalid character(s). Reset it to default */
+	    zsfree(ifs);
+	    ifs = ztrdup(CURRENT_DEFAULT_IFS);
+	    set_widearray(ifs, &ifs_wide);
+	}
+    }
+#endif
+    for (s = ifs ? ifs : CURRENT_DEFAULT_IFS; *s; s++) {
 	int c = STOUC(*s == Meta ? *++s ^ 32 : *s);
 #ifdef MULTIBYTE_SUPPORT
-	if (!isascii(c)) {
+	if (isset(MULTIBYTE) && !isascii(c)) {
 	    /* see comment for wordchars below */
 	    continue;
 	}
@@ -4151,10 +4161,15 @@ inittyptab(void)
 	}
 	typtab[c] |= ISEP;
     }
+    /* wordchars */
+#ifdef MULTIBYTE_SUPPORT
+    if (isset(MULTIBYTE))
+	set_widearray(wordchars, &wordchars_wide);
+#endif
     for (s = wordchars ? wordchars : DEFAULT_WORDCHARS; *s; s++) {
 	int c = STOUC(*s == Meta ? *++s ^ 32 : *s);
 #ifdef MULTIBYTE_SUPPORT
-	if (!isascii(c)) {
+	if (isset(MULTIBYTE) && !isascii(c)) {
 	    /*
 	     * If we have support for multibyte characters, we don't
 	     * handle non-ASCII characters here; instead, we turn
@@ -4167,11 +4182,6 @@ inittyptab(void)
 #endif
 	typtab[c] |= IWORD;
     }
-#ifdef MULTIBYTE_SUPPORT
-    set_widearray(wordchars, &wordchars_wide);
-    set_widearray(ifs ? ifs : EMULATION(EMULATE_KSH|EMULATE_SH) ?
-	DEFAULT_IFS_SH : DEFAULT_IFS, &ifs_wide);
-#endif
     for (s = SPECCHARS; *s; s++)
 	typtab[STOUC(*s)] |= ISPECIAL;
     if (typtab_flags & ZTF_SP_COMMA)
diff --git a/Test/D04parameter.ztst b/Test/D04parameter.ztst
index 6bf55b4db..d9f81f66d 100644
--- a/Test/D04parameter.ztst
+++ b/Test/D04parameter.ztst
@@ -2275,6 +2275,18 @@ F:We do not care what $OLDPWD is, as long as it does not cause an error
 F:As of this writing, var=$@ and var="$@" with null IFS have unspecified
 F:behavior, see http://austingroupbugs.net/view.php?id=888
 
+  (
+  IFS=$'\x80'
+  if [[ $IFS = $' \t\n\0' ]]; then
+    echo OK     # if $'\x80' is illegal
+  else          # otherwise, it should work as a separator
+    s=$'foo\x80\bar'
+    [[ ${${=s}[1]} = foo ]] && echo OK
+  fi
+  )
+0:reset IFS to default if it contains illegal character
+>OK
+
   () {
     setopt localoptions extendedglob
     [[ $- = [[:alnum:]]## ]] || print Failed 1





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bug] busyloop upon $=var with NULs when $IFS contains both NUL and a byte > 0x7f
  2022-12-13  9:49     ` Jun T
@ 2022-12-13 10:13       ` Peter Stephenson
  2022-12-13 11:40         ` Jun T
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Stephenson @ 2022-12-13 10:13 UTC (permalink / raw)
  To: zsh-workers

> On 13/12/2022 09:49 Jun T <takimoto-j@kba.biglobe.ne.jp> wrote:
> > 2022/11/29 23:38, Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> > 
> >> On 29/11/2022 14:27 Jun. T <takimoto-j@kba.biglobe.ne.jp> wrote:
> >> 3834         for (; *s; r++) {                                                 
> >> 3835             char *ie = itype_end(s, ISEP, 1);               
> >> 3836             if (ie != s) {                                             
> >> 3837                 s = ie;                                                
> >> ....                          
> >> 3840             }                                          
> >> 3841             (void)findsep(&s, NULL, 0);
> >> ....
> >> 3845         }
> >> 
> >> Here, the pointer s already points to a ISEP (\x83\x20 = metafied Nul),
> >> but itype_end() can't find the next ISEP (ie == s) due to the empty
> >> ifs_wide, and findsep() does not move s because *s is already ISEP,
> >> resulting in infinite-loop with the same s.
> > 
> > I guess the obvious thing to do here is any time s doesn't move
> > give up splitting at that point.
> 
> How can we 'give up'?  Is it OK to call exit()?

By giving up, I meant we stop splitting at that point and leave the
rest of the string intact.  If splitting fails, stop splitting...

pws


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bug] busyloop upon $=var with NULs when $IFS contains both NUL and a byte > 0x7f
  2022-12-13 10:13       ` Peter Stephenson
@ 2022-12-13 11:40         ` Jun T
  2022-12-13 11:55           ` Peter Stephenson
  0 siblings, 1 reply; 14+ messages in thread
From: Jun T @ 2022-12-13 11:40 UTC (permalink / raw)
  To: zsh-workers


> 2022/12/13 19:13, Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> 
>> On 13/12/2022 09:49 Jun T <takimoto-j@kba.biglobe.ne.jp> wrote:
>> 
>> How can we 'give up'?  Is it OK to call exit()?
> 
> By giving up, I meant we stop splitting at that point and leave the
> rest of the string intact.  If splitting fails, stop splitting...

If we modify wordcount() in this way, then we also need to modify
at least sepsplit() and spacesplit()?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bug] busyloop upon $=var with NULs when $IFS contains both NUL and a byte > 0x7f
  2022-12-13 11:40         ` Jun T
@ 2022-12-13 11:55           ` Peter Stephenson
  2023-06-21  4:49             ` Jun T
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Stephenson @ 2022-12-13 11:55 UTC (permalink / raw)
  To: zsh-workers

> On 13/12/2022 11:40 Jun T <takimoto-j@kba.biglobe.ne.jp> wrote:
> 
>  
> > 2022/12/13 19:13, Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> > 
> >> On 13/12/2022 09:49 Jun T <takimoto-j@kba.biglobe.ne.jp> wrote:
> >> 
> >> How can we 'give up'?  Is it OK to call exit()?
> > 
> > By giving up, I meant we stop splitting at that point and leave the
> > rest of the string intact.  If splitting fails, stop splitting...
> 
> If we modify wordcount() in this way, then we also need to modify
> at least sepsplit() and spacesplit()?

Right, this would have to be a general policy.

I'm not necessarily hooked on this way of doing it; it seems to
me it's a little bit less surprising than resetting IFS but I
suppose errors are always surprising.

pws


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bug] busyloop upon $=var with NULs when $IFS contains both NUL and a byte > 0x7f
  2022-12-13 11:55           ` Peter Stephenson
@ 2023-06-21  4:49             ` Jun T
  0 siblings, 0 replies; 14+ messages in thread
From: Jun T @ 2023-06-21  4:49 UTC (permalink / raw)
  To: zsh-workers

Sorry, I've forgotten to finish this.

The patch below is the same as 51205⁩ (Dec.13, 2022), except:
zwarn() is added when resetting IFS.
a test is added to confirm that IFS can contain any bytes
if multibyte option is off.


diff --git a/Doc/Zsh/params.yo b/Doc/Zsh/params.yo
index 57d10b8bd..e0410d673 100644
--- a/Doc/Zsh/params.yo
+++ b/Doc/Zsh/params.yo
@@ -1325,15 +1325,18 @@ Internal field separators (by default space, tab, newline and NUL), that
 are used to separate words which result from
 command or parameter expansion and words read by
 the tt(read) builtin.  Any characters from the set space, tab and
-newline that appear in the IFS are called em(IFS white space).
+newline that appear in the tt(IFS) are called em(IFS white space).
 One or more IFS white space characters or one non-IFS white space
 character together with any adjacent IFS white space character delimit
 a field.  If an IFS white space character appears twice consecutively
-in the IFS, this character is treated as if it were not an IFS white
+in the tt(IFS), this character is treated as if it were not an IFS white
 space character.
 
 If the parameter is unset, the default is used.  Note this has
 a different effect from setting the parameter to an empty string.
+
+If tt(MULTIBYTE) option is on and tt(IFS) contains invalid characters in
+the current locale, it is reset to the default.
 )
 vindex(KEYBOARD_HACK)
 item(tt(KEYBOARD_HACK))(
diff --git a/Src/params.c b/Src/params.c
index 021d341e8..e25c1286c 100644
--- a/Src/params.c
+++ b/Src/params.c
@@ -4723,6 +4723,7 @@ setlang(char *x)
 	if ((x = getsparam_u(ln->name)) && *x)
 	    setlocale(ln->category, x);
     unqueue_signals();
+    inittyptab();
 }
 
 /**/
@@ -4746,6 +4747,7 @@ lc_allsetfn(Param pm, char *x)
     else {
 	setlocale(LC_ALL, unmeta(x));
 	clear_mbstate();
+	inittyptab();
     }
 }
 
@@ -4784,6 +4786,7 @@ lcsetfn(Param pm, char *x)
     }
     unqueue_signals();
     clear_mbstate();	/* LC_CTYPE may have changed */
+    inittyptab();
 }
 #endif /* USE_LOCALE */
 
diff --git a/Src/utils.c b/Src/utils.c
index f13e3a79d..58b4f7149 100644
--- a/Src/utils.c
+++ b/Src/utils.c
@@ -74,9 +74,6 @@ set_widearray(char *mb_array, Widechar_array wca)
     }
     wca->len = 0;
 
-    if (!isset(MULTIBYTE))
-	return;
-
     if (mb_array) {
 	VARARR(wchar_t, tmpwcs, strlen(mb_array));
 	wchar_t *wcptr = tmpwcs;
@@ -87,8 +84,7 @@ set_widearray(char *mb_array, Widechar_array wca)
 	    int mblen;
 
 	    if ((unsigned char) *mb_array <= 0x7f) {
-		mb_array++;
-		*wcptr++ = (wchar_t)*mb_array;
+		*wcptr++ = (wchar_t)*mb_array++;
 		continue;
 	    }
 
@@ -4121,8 +4117,9 @@ inittyptab(void)
      * having IIDENT here is a good idea at all, but this code
      * should disappear into history...
      */
-    for (t0 = 0240; t0 != 0400; t0++)
-	typtab[t0] = IALPHA | IALNUM | IIDENT | IUSER | IWORD;
+    if isset(MULTIBYTE)
+	for (t0 = 0240; t0 != 0400; t0++)
+	    typtab[t0] = IALPHA | IALNUM | IIDENT | IUSER | IWORD;
 #endif
     /* typtab['.'] |= IIDENT; */ /* Allow '.' in variable names - broken */
     typtab['_'] = IIDENT | IUSER;
@@ -4137,11 +4134,24 @@ inittyptab(void)
 	typtab[t0] |= ITOK | IMETA;
     for (t0 = (int) (unsigned char) Snull; t0 <= (int) (unsigned char) Nularg; t0++)
 	typtab[t0] |= ITOK | IMETA | INULL;
-    for (s = ifs ? ifs : EMULATION(EMULATE_KSH|EMULATE_SH) ?
-	DEFAULT_IFS_SH : DEFAULT_IFS; *s; s++) {
+     /* ifs */
+#define CURRENT_DEFAULT_IFS (EMULATION(EMULATE_KSH|EMULATE_SH) ? \
+			    DEFAULT_IFS_SH : DEFAULT_IFS)
+#ifdef MULTIBYTE_SUPPORT
+    if (isset(MULTIBYTE)) {
+	set_widearray(ifs ? ifs : CURRENT_DEFAULT_IFS, &ifs_wide);
+	if (ifs && !ifs_wide.chars) {
+	    zwarn("IFS has an invalid character; resetting IFS to default");
+	    zsfree(ifs);
+	    ifs = ztrdup(CURRENT_DEFAULT_IFS);
+	    set_widearray(ifs, &ifs_wide);
+	}
+    }
+#endif
+     for (s = ifs ? ifs : CURRENT_DEFAULT_IFS; *s; s++) {
 	int c = (unsigned char) (*s == Meta ? *++s ^ 32 : *s);
 #ifdef MULTIBYTE_SUPPORT
-	if (!isascii(c)) {
+ 	if (isset(MULTIBYTE) && !isascii(c)) {
 	    /* see comment for wordchars below */
 	    continue;
 	}
@@ -4154,10 +4164,15 @@ inittyptab(void)
 	}
 	typtab[c] |= ISEP;
     }
+    /* wordchars */
+#ifdef MULTIBYTE_SUPPORT
+    if (isset(MULTIBYTE))
+	set_widearray(wordchars, &wordchars_wide);
+#endif
     for (s = wordchars ? wordchars : DEFAULT_WORDCHARS; *s; s++) {
 	int c = (unsigned char) (*s == Meta ? *++s ^ 32 : *s);
 #ifdef MULTIBYTE_SUPPORT
-	if (!isascii(c)) {
+	if (isset(MULTIBYTE) && !isascii(c)) {
 	    /*
 	     * If we have support for multibyte characters, we don't
 	     * handle non-ASCII characters here; instead, we turn
@@ -4170,11 +4185,6 @@ inittyptab(void)
 #endif
 	typtab[c] |= IWORD;
     }
-#ifdef MULTIBYTE_SUPPORT
-    set_widearray(wordchars, &wordchars_wide);
-    set_widearray(ifs ? ifs : EMULATION(EMULATE_KSH|EMULATE_SH) ?
-	DEFAULT_IFS_SH : DEFAULT_IFS, &ifs_wide);
-#endif
     for (s = SPECCHARS; *s; s++)
 	typtab[(unsigned char) *s] |= ISPECIAL;
     if (typtab_flags & ZTF_SP_COMMA)
diff --git a/Test/D04parameter.ztst b/Test/D04parameter.ztst
index 2fd2f975f..0d44558a7 100644
--- a/Test/D04parameter.ztst
+++ b/Test/D04parameter.ztst
@@ -2280,6 +2280,27 @@ F:We do not care what $OLDPWD is, as long as it does not cause an error
 F:As of this writing, var=$@ and var="$@" with null IFS have unspecified
 F:behavior, see http://austingroupbugs.net/view.php?id=888
 
+  (
+  IFS=$'\x80'
+  if [[ $IFS = $' \t\n\0' ]]; then
+    echo OK     # if $'\x80' is illegal (e.g. Linux)
+  else          # otherwise (e.g. macOS), it should work as a separator
+    s=$'foo\x80\bar'
+    [[ ${${=s}[1]} = foo ]] && echo OK
+  fi
+  )
+0D:reset IFS to default if it contains illegal character
+>OK
+
+  (
+  unsetopt multibyte
+  IFS=$'\xc3\xa9'
+  s=$'foo\xc3bar\xa9boo'
+  echo ${${=s}[2]}
+  )
+0:eight bit chars in IFS should work if multibute option is off
+>bar
+
   () {
     setopt localoptions extendedglob
     [[ $- = [[:alnum:]]## ]] || print Failed 1




^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-06-21  4:50 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-18 14:27 [bug] busyloop upon $=var with NULs when $IFS contains both NUL and a byte > 0x7f Stephane Chazelas
2022-11-29 14:27 ` Jun. T
2022-11-29 14:38   ` Peter Stephenson
2022-11-30  4:20     ` Bart Schaefer
2022-11-30  9:21       ` Peter Stephenson
2022-12-13  9:50         ` Jun T
2022-12-13  9:49     ` Jun T
2022-12-13 10:13       ` Peter Stephenson
2022-12-13 11:40         ` Jun T
2022-12-13 11:55           ` Peter Stephenson
2023-06-21  4:49             ` Jun T
2022-12-11 19:12   ` Stephane Chazelas
2022-12-13  9:51   ` Jun T
2022-11-30 14:56 ` Jun. T

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).