read -d $'\200' doesn't work with set +o multibyte

zsh-workers
 help / color / mirror / code / Atom feed

* read -d $'\200' doesn't work with set +o multibyte
@ 2022-12-09 15:42 Stephane Chazelas
  2022-12-09 20:05 ` Oliver Kiddle
  0 siblings, 1 reply; 10+ messages in thread
From: Stephane Chazelas @ 2022-12-09 15:42 UTC (permalink / raw)
  To: Zsh hackers list

Even in a locale with a single-byte charmap, when multibyte is
off, I can't make read -d work when the delimiter is a byte >=
0x80.

$ LC_ALL=en_GB.iso885915 ./Src/zsh +o multibyte
$ locale charmap
ISO-8859-15
$ locale ctype-mb-cur-max
1
$ print 'a\351b' | read -rd $'\351'
$ print $?
1
$ print -r -- $REPLY | LC_ALL=C od -tc
0000000   a 351   b  \n
0000004

Without set +o multibyte, the above works (at treating 0351 (é
in that charset) as a delimiter), but not for \200 (undefined in
ISO-8859-1 which I guess is expected).

With LC_ALL=C, on GNU systems where mbrtowc() returns -1 EILSEQ
for any byte >= 0x80, I find read -d doesn't work for byte >=
0x80 used as delimiter with or without set +o multibyte.

(on Debian GNU/Linux amd64 with 5.9 or git HEAD).

I've raised a related issue against ksh93
(https://github.com/ksh93/ksh/issues/590)

It looks like POSIX are considering specifying  read -d. They
would leave it unspecified if the delimiter is neither the empty
string nor a single-byte character.
https://austingroupbugs.net/view.php?id=243#c6091

-- 
Stephane

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: read -d $'\200' doesn't work with set +o multibyte
  2022-12-09 15:42 read -d $'\200' doesn't work with set +o multibyte Stephane Chazelas
@ 2022-12-09 20:05 ` Oliver Kiddle
  2022-12-10  9:06   ` read -d $'\200' doesn't work with set +o multibyte (and [PATCH]) Stephane Chazelas
  0 siblings, 1 reply; 10+ messages in thread
From: Oliver Kiddle @ 2022-12-09 20:05 UTC (permalink / raw)
  To: Zsh hackers list

Stephane Chazelas wrote:
> Even in a locale with a single-byte charmap, when multibyte is
> off, I can't make read -d work when the delimiter is a byte >=
> 0x80.

In my testing, it does work in a single-byte locale. I tested on
multiple systems.

Looking at the multibyte implementation of read, the approach taken
is to use a wchar_t for the delimiter and then maintain mbstate_t for
the input. This supports a delimiter that can be any single unicode
codepoint. In my testing this is working as intended. But note that \351
alone is incomplete in UTF-8 terms so what wchar_t value should that be
mapped to.

Also interesting to consider is the range \x7f to \x9f in an ISO-8859-x
locale. Those are duplicates of the control characters. In my testing
with a single-byte locale \x89 as a delimiter will end input at a tab
character but the converse (\t as a delimiter) will not terminate at
\x89 in the input.

My understanding of the proposed POSIX wording is that it requires
the individual octet, regardless of any character mapping to be the
delimiter. Does anyone track the austin list? Would be good if they can
be persuaded to relax what they specify. The part I especially object to
is requiring that the input does not contain null bytes. The fact that
zsh can cope with nulls is often really useful. Why can't they leave
that unspecified? I can understand wanting to standardise a lowest
common denominator but that is punishing an existing richer
implementation.

One way forward would be to take the argument to -d as a literal and
potentially multi-byte delimiter. UTF-8 has the property that a valid
sequence can't occur within a longer sequence so for UTF-8 you would not
need to worry about it finding a delimiter within a different
character. This is not the case with combining characters but the
current implementation will also stop at the uncombined character.
There are other multi-byte encodings for which this is not true. I've
no idea how relevant things like EUC-JP and Shift-JIS still are.

A side effect of this would be support for strings of quite distinct
characters as a multi-character delimiter.

Should we document the fact that -d '' works like -d $'\0'? Perhaps mark
this as being for compatibility with other shells? Fortunately, it does
work as specified but this may only be by accident. When the -d feature
was added, it was probably only checked that the behaviour with an empty
delimiter was sane.

> $ LC_ALL=en_GB.iso885915 ./Src/zsh +o multibyte
> $ locale charmap
> ISO-8859-15

What do you get with the following, I'd sooner trust this:
  zmodload zsh/langinfo; echo $langinfo[CODESET]

Oliver

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
  2022-12-09 20:05 ` Oliver Kiddle
@ 2022-12-10  9:06   ` Stephane Chazelas
  2022-12-13 11:12     ` Jun T
  2022-12-15  2:01     ` Oliver Kiddle
  0 siblings, 2 replies; 10+ messages in thread
From: Stephane Chazelas @ 2022-12-10  9:06 UTC (permalink / raw)
  To: Oliver Kiddle; +Cc: Zsh hackers list

2022-12-09 21:05:02 +0100, Oliver Kiddle:
> Stephane Chazelas wrote:
> > Even in a locale with a single-byte charmap, when multibyte is
> > off, I can't make read -d work when the delimiter is a byte >=
> > 0x80.
> 
> In my testing, it does work in a single-byte locale. I tested on
> multiple systems.
> 
> Looking at the multibyte implementation of read, the approach taken
> is to use a wchar_t for the delimiter and then maintain mbstate_t for
> the input. This supports a delimiter that can be any single unicode
> codepoint. In my testing this is working as intended. But note that \351
> alone is incomplete in UTF-8 terms so what wchar_t value should that be
> mapped to.

Note that here I'm talking of the case where multibyte is
*disabled* (zsh +o multibyte), and where UTF-8 (or any other
multibyte charset) is nowhere in the picture. As I said, with
multibyte on, it works for valid characters; in iso8859-15 on
GNU systems, that's 0..0x7f, 0xa0..0xff.

IIRC In other areas of the code, bytes that can't be decoded
into characters are decoded as 0xdc00 + byte.

$ grep -rnwi 0xdc00 .
./ChangeLog:12625:      invalid multibyte characters to 0xDC00 + index which is invalid
./Src/pattern.c:242:    ((wchar_t) (0xDC00 + STOUC(ch)))

See workers/36411 workers/36415

It would be great if something like that was done everywhere so
we can always deal with arbitrary arrays of bytes regardless of
the locale.

> Also interesting to consider is the range \x7f to \x9f in an ISO-8859-x
> locale. Those are duplicates of the control characters. In my testing
> with a single-byte locale \x89 as a delimiter will end input at a tab
> character but the converse (\t as a delimiter) will not terminate at
> \x89 in the input.

I can't reproduce here:

~$ LC_ALL=en_GB.iso885915 zsh +o multibyte -c "IFS= read -rd $'\x89' a <<< $'a\tb'; print -rn -- \$a" | hd
00000000  61 09 62 0a                                       |a.b.|
00000004
~$ LC_ALL=en_GB.iso885915 zsh -c "IFS= read -rd $'\x89' a <<< $'a\tb'; print -rn -- \$a" | hd
00000000  61 09 62 0a                                       |a.b.|
00000004
~$ LC_ALL=en_GB.UTF-8 zsh -c "IFS= read -rd $'\x89' a <<< $'a\tb'; print -rn -- \$a" | hd
00000000  61 09 62 0a                                       |a.b.|
00000004
~$ LC_ALL=en_GB.UTF-8 zsh +o multibyte -c "IFS= read -rd $'\x89' a <<< $'a\tb'; print -rn -- \$a" | hd
00000000  61 09 62 0a                                       |a.b.|
00000004

> My understanding of the proposed POSIX wording is that it requires
> the individual octet, regardless of any character mapping to be the
> delimiter. Does anyone track the austin list? Would be good if they can
> be persuaded to relax what they specify. The part I especially object to
> is requiring that the input does not contain null bytes. The fact that
> zsh can cope with nulls is often really useful. Why can't they leave
> that unspecified? I can understand wanting to standardise a lowest
> common denominator but that is punishing an existing richer
> implementation.

Not sure what you mean. The proposed text has:

 -d delim

    If delim consists of one single-byte character, that byte
    shall be used as the logical line delimiter. If delim is the
    null string, the logical line delimiter shall be the null
    byte. Otherwise, the behavior is unspecified.

That's added alongside xargs -0 and find's -prin0 to be able to
deal with arbitrary file names, so the point is for it to work
on input with NULs.

The:

 If the -d delim option is specified and delim consists of one
 single-byte character other than <newline>, the standard input
 shall contain zero or more characters, shall not contain any
 null bytes, and (if not empty) shall end with delim.

Is a requirement on the *application*, not the implementation.
That is, it only specifies what's meant to happen when the input
doesn't contain NULs.

So I think we're good here.

I'm susbscribed to both austin-group-l and zsh-workers but don't
follow them very closely. I try to mention things relevant to
zsh here when I spot them on austin-group-l and I try to argue
there about things that would conflict with the zsh way for no
good reason.

austin-group-l is not large volume, I would recommend at least
one zsh developer get in there. I see the maintainers of FreeBSD
sh, NetBSD sh, mksh, bash at least occasionally contributing
there. You can also get an account on their bug tracker. I've
got one and I'm not the maintainer of any software relevant to
POSIX.

Changes in the bug tracker are posted to the ML. It's often
preferable to add a comment on a ticket than post on the ML.

> One way forward would be to take the argument to -d as a literal and
> potentially multi-byte delimiter. UTF-8 has the property that a valid
> sequence can't occur within a longer sequence so for UTF-8 you would not
> need to worry about it finding a delimiter within a different
> character. This is not the case with combining characters but the
> current implementation will also stop at the uncombined character.
> There are other multi-byte encodings for which this is not true. I've
> no idea how relevant things like EUC-JP and Shift-JIS still are.

Things like Shift-JIS are unworkable. I don't expect anyone to
still be using them.

GB18030 and BIG5/BIG5-HKSCS may still be relevant. They don't
work on Shift state like Shift-JIS, but many of their characters
have bytes <= 0x7f, and zsh doesn't really work with them for
that reason.

$ echo αε | iconv -t BIG5-HKSCS | hd
00000000  a3 5c a3 60 0a                                    |.\.`.|
00000005

Simply *having* locales with those charsets opens your system up
to security vulnerabilities as you have alpha characters which
contain \ and `, special to the shell and many other things.

in practice not many things work with them, it's not just zsh;
I've noticed newer Debian/Ubuntu doesn't offer locales with them
any longer (though you can still generate some if you like).

[...]
> Should we document the fact that -d '' works like -d $'\0'? Perhaps mark
> this as being for compatibility with other shells? Fortunately, it does
> work as specified but this may only be by accident. When the -d feature
> was added, it was probably only checked that the behaviour with an empty
> delimiter was sane.

Yes, I agree it's worth documenting. AFAIK, read -d is from
ksh93. read -d '' likely works in bash (added there in 2.04) by
accident as well (first byte of a NUL-delimited string), it
didn't in ksh93.

IFS= read -rd '' is a well known coding pattern in bash.
read -d '' now works in ksh93u+m and mksh.

-d is likely used much more often with '' than with anything
else.

> > $ LC_ALL=en_GB.iso885915 ./Src/zsh +o multibyte
> > $ locale charmap
> > ISO-8859-15
> 
> What do you get with the following, I'd sooner trust this:
>   zmodload zsh/langinfo; echo $langinfo[CODESET]

Same ISO-8859-15

$ LC_ALL=en_GB.iso885915  ./Src/zsh +o multibyte -c "IFS= read -rd $'\351' a <<< a$'\351'b; print -rn -- \$a" | hd
00000000  61 e9 62 0a                                       |a.b.|
00000004

gdb under LC_ALL=en_GB.iso885915 luit

6402        if (OPT_ISSET(ops,'d')) {
(gdb)
6403            char *delimstr = OPT_ARG(ops,'d');
(gdb)
6407            if (isset(MULTIBYTE)) {
(gdb)
6412                wi = WEOF;
(gdb)
6413            if (wi != WEOF)
(gdb)
6416                delim = (wchar_t)((delimstr[0] == Meta) ?
(gdb)
6417                                  delimstr[1] ^ 32 : delimstr[0]);
(gdb)
6416                delim = (wchar_t)((delimstr[0] == Meta) ?
(gdb)
6421            if (SHTTY != -1) {
(gdb) p delim
$1 = -23 L'\xffffffe9'
(gdb) p delimstr
$2 = 0x7ffff7fa1790 "é"

(as delimstr is a signed char* instead of unsigned char I guess).

It works better after:

diff --git a/Src/builtin.c b/Src/builtin.c
index a7b7755a7..d650ca750 100644
--- a/Src/builtin.c
+++ b/Src/builtin.c
@@ -6414,9 +6414,9 @@ bin_read(char *name, char **args, Options ops, UNUSED(int func))
 	    delim = (wchar_t)wi;
 	else
 	    delim = (wchar_t)((delimstr[0] == Meta) ?
-			      delimstr[1] ^ 32 : delimstr[0]);
+			      STOUC(delimstr[1]) ^ 32 : STOUC(delimstr[0]));
 #else
-        delim = (delimstr[0] == Meta) ? delimstr[1] ^ 32 : delimstr[0];
+        delim = (delimstr[0] == Meta) ? STOUC(delimstr[1]) ^ 32 : STOUC(delimstr[0]);
 #endif
 	if (SHTTY != -1) {
 	    struct ttyinfo ti;

(I don't know if it's the proper way to cast, my C is rusty)

Including for bytes that don't map to any character in ISO8859-15:

$ LC_ALL=en_GB.iso885915 zsh -c "IFS= read -rd $'\x80' a <<< $'a\x80b'; print -rn -- \$a" | hd
00000000  61                                                |a|
00000001

So I guess that's the fix for my bug.

-- 
Stephane

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
  2022-12-10  9:06   ` read -d $'\200' doesn't work with set +o multibyte (and [PATCH]) Stephane Chazelas
@ 2022-12-13 11:12     ` Jun T
  2022-12-14 21:42       ` Oliver Kiddle
  2022-12-15  2:01     ` Oliver Kiddle
  1 sibling, 1 reply; 10+ messages in thread
From: Jun T @ 2022-12-13 11:12 UTC (permalink / raw)
  To: zsh-workers



> 2022/12/10 18:06, Stephane Chazelas <stephane@chazelas.org> wrote:

> It works better after:
> 
> diff --git a/Src/builtin.c b/Src/builtin.c
> index a7b7755a7..d650ca750 100644
> --- a/Src/builtin.c
> +++ b/Src/builtin.c
> @@ -6414,9 +6414,9 @@ bin_read(char *name, char **args, Options ops, UNUSED(int func))
> 	    delim = (wchar_t)wi;
> 	else
> 	    delim = (wchar_t)((delimstr[0] == Meta) ?
> -			      delimstr[1] ^ 32 : delimstr[0]);
> +			      STOUC(delimstr[1]) ^ 32 : STOUC(delimstr[0]));
> #else
> -        delim = (delimstr[0] == Meta) ? delimstr[1] ^ 32 : delimstr[0];
> +        delim = (delimstr[0] == Meta) ? STOUC(delimstr[1]) ^ 32 : STOUC(delimstr[0]);
> #endif
> 	if (SHTTY != -1) {
> 	    struct ttyinfo ti;
> 
(snip)
> So I guess that's the fix for my bug.

Thanks, I think it fixes the problem for the '#ifdef MULTIBYTE_SUPPORT' section.

When MULTIBYTE_SUPPORT is not defined, delim is char, so we need
STOUC() not when assigning to delim but when using delim.
But instead of adding STOUC() to every use of delim (in nondef
MULTIBYTE_SUPPORT section), it would be easier to define delim as int.

A simple test is added (it only tests the C locale with multibyte option on).


diff --git a/Src/builtin.c b/Src/builtin.c
index a7b7755a7..a6fadb622 100644
--- a/Src/builtin.c
+++ b/Src/builtin.c
@@ -6286,7 +6286,7 @@ bin_read(char *name, char **args, Options ops, UNUSED(int func))
     char *laststart;
     size_t ret;
 #else
-    char delim = '\n';
+    int delim = '\n';
 #endif
 
     if (OPT_HASARG(ops,c='k')) {
@@ -6413,10 +6413,10 @@ bin_read(char *name, char **args, Options ops, UNUSED(int func))
 	if (wi != WEOF)
 	    delim = (wchar_t)wi;
 	else
-	    delim = (wchar_t)((delimstr[0] == Meta) ?
+	    delim = (wchar_t)STOUC((delimstr[0] == Meta) ?
 			      delimstr[1] ^ 32 : delimstr[0]);
 #else
-        delim = (delimstr[0] == Meta) ? delimstr[1] ^ 32 : delimstr[0];
+        delim = STOUC((delimstr[0] == Meta) ? delimstr[1] ^ 32 : delimstr[0]);
 #endif
 	if (SHTTY != -1) {
 	    struct ttyinfo ti;
diff --git a/Test/B04read.ztst b/Test/B04read.ztst
index 25c3d4173..a2f03c9b3 100644
--- a/Test/B04read.ztst
+++ b/Test/B04read.ztst
@@ -82,6 +82,12 @@
 >Testing the
 >null hypothesis
 
+ print -n $'first line\x80second line\x80' |
+ while read -d $'\x80' line; do print $line; done
+0:read with a delimeter >= 0x80
+>first line
+>second line
+
 # Note that trailing NULLs are not stripped even if they are in
 # $IFS; only whitespace characters contained in $IFS are stripped.
  print -n $'Aaargh, I hate nulls.\0\0\0' | read line





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
  2022-12-13 11:12     ` Jun T
@ 2022-12-14 21:42       ` Oliver Kiddle
  2022-12-15 12:37         ` Jun. T
  0 siblings, 1 reply; 10+ messages in thread
From: Oliver Kiddle @ 2022-12-14 21:42 UTC (permalink / raw)
  To: Jun T; +Cc: zsh-workers

Jun T wrote:
> Thanks, I think it fixes the problem for the '#ifdef MULTIBYTE_SUPPORT' section.
>
> When MULTIBYTE_SUPPORT is not defined, delim is char, so we need
> STOUC() not when assigning to delim but when using delim.
> But instead of adding STOUC() to every use of delim (in nondef
> MULTIBYTE_SUPPORT section), it would be easier to define delim as int.

At least in my testing, it appears to also work to define delim as
unsigned char which I would find less confusing.

> + print -n $'first line\x80second line\x80' |
> + while read -d $'\x80' line; do print $line; done
> +0:read with a delimeter >= 0x80

There's a typo in "delimiter"

The patch below needs to be applied on top of your patch. It adds a few
more test cases, documents (and tests) the empty string being an
alternative way to set the delimiter to NUL. It also addresses the
additional problem I was hitting when trying to reproduce the original
problem. Rather than follow the 0xdc00 + byte suggestion it was
easier to simply set a separate flag variable and follow the
!isset(MULTIBYTE) path through the later code.

Oliver

diff --git a/Doc/Zsh/builtins.yo b/Doc/Zsh/builtins.yo
index b6217f66d..56428a714 100644
--- a/Doc/Zsh/builtins.yo
+++ b/Doc/Zsh/builtins.yo
@@ -1589,7 +1589,8 @@ Input is read from the coprocess.
 )
 item(tt(-d) var(delim))(
 Input is terminated by the first character of var(delim) instead of
-by newline.
+by newline.  For compatibility with other shells, if var(delim) is an
+empty string, input is terminated at the first NUL.
 )
 item(tt(-t) [ var(num) ])(
 Test if input is available before attempting to read.  If var(num)
diff --git a/Src/builtin.c b/Src/builtin.c
index a6fadb622..09d0ca2f0 100644
--- a/Src/builtin.c
+++ b/Src/builtin.c
@@ -6282,6 +6282,7 @@ bin_read(char *name, char **args, Options ops, UNUSED(int func))
     long izle_timeout = 0;
 #ifdef MULTIBYTE_SUPPORT
     wchar_t delim = L'\n', wc;
+    int rawbyte = 0;
     mbstate_t mbs;
     char *laststart;
     size_t ret;
@@ -6412,9 +6413,11 @@ bin_read(char *name, char **args, Options ops, UNUSED(int func))
 	    wi = WEOF;
 	if (wi != WEOF)
 	    delim = (wchar_t)wi;
-	else
+	else {
 	    delim = (wchar_t)STOUC((delimstr[0] == Meta) ?
 			      delimstr[1] ^ 32 : delimstr[0]);
+	    rawbyte = 1;
+	}
 #else
         delim = STOUC((delimstr[0] == Meta) ? delimstr[1] ^ 32 : delimstr[0]);
 #endif
@@ -6841,7 +6844,7 @@ bin_read(char *name, char **args, Options ops, UNUSED(int func))
 		break;
 	    }
 	    *bptr = (char)c;
-	    if (isset(MULTIBYTE)) {
+	    if (isset(MULTIBYTE) && !rawbyte) {
 		ret = mbrtowc(&wc, bptr, 1, &mbs);
 		if (!ret)	/* NULL */
 		    ret = 1;
diff --git a/Test/B04read.ztst b/Test/B04read.ztst
index a2f03c9b3..f50c43682 100644
--- a/Test/B04read.ztst
+++ b/Test/B04read.ztst
@@ -82,6 +82,10 @@
 >Testing the
 >null hypothesis
 
+ read -ed '' <<<$'one\0two'
+0:empty delimiter terminates at nulls
+>one
+
  print -n $'first line\x80second line\x80' |
  while read -d $'\x80' line; do print $line; done
 0:read with a delimeter >= 0x80
diff --git a/Test/D07multibyte.ztst b/Test/D07multibyte.ztst
index 6909346cb..413c4fe73 100644
--- a/Test/D07multibyte.ztst
+++ b/Test/D07multibyte.ztst
@@ -212,6 +212,20 @@
 >first
 >second
 
+  read -ed £
+0:read with multibyte delimiter where bytes of delimiter also occur in input
+<one¤twoãthree£four
+>one¤twoãthree
+
+  read -ed $'\xa0' <<<$'first\xa0second'
+0:read delimited by a byte that isn't a valid multibyte character
+>first
+
+  read -ed $'\xc2'
+0:read delimited by a single byte terminates if the byte is part of a multibyte character
+<one£two
+>one
+
   (IFS=«
   read -d » -A array
   print -l $array)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
  2022-12-10  9:06   ` read -d $'\200' doesn't work with set +o multibyte (and [PATCH]) Stephane Chazelas
  2022-12-13 11:12     ` Jun T
@ 2022-12-15  2:01     ` Oliver Kiddle
  1 sibling, 0 replies; 10+ messages in thread
From: Oliver Kiddle @ 2022-12-15  2:01 UTC (permalink / raw)
  To: Zsh hackers list

On 10 Dec, Stephane Chazelas wrote:
> Is a requirement on the *application*, not the implementation.
> That is, it only specifies what's meant to happen when the input
> doesn't contain NULs.
>
> So I think we're good here.

Ok, good. I'm afraid I'm not very good at interpreting the specific
language used in those standards.

> I'm susbscribed to both austin-group-l and zsh-workers but don't
> follow them very closely. I try to mention things relevant to
> zsh here when I spot them on austin-group-l and I try to argue
> there about things that would conflict with the zsh way for no
> good reason.

Thanks for passing such things on. I did subscribe some time ago but
time is limited and it didn't spark joy so I unsubscribed. I still have
a login for the site at least.

> GB18030 and BIG5/BIG5-HKSCS may still be relevant. They don't
> work on Shift state like Shift-JIS, but many of their characters
> have bytes <= 0x7f, and zsh doesn't really work with them for
> that reason.

Thanks for the answer.

Oliver


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
  2022-12-14 21:42       ` Oliver Kiddle
@ 2022-12-15 12:37         ` Jun. T
  2022-12-16  8:29           ` Oliver Kiddle
  0 siblings, 1 reply; 10+ messages in thread
From: Jun. T @ 2022-12-15 12:37 UTC (permalink / raw)
  To: zsh-workers


> 2022/12/15 6:42, Oliver Kiddle <opk@zsh.org> wrote:
> 
> At least in my testing, it appears to also work to define delim as
> unsigned char which I would find less confusing.

I used int since delim is always compared with 'c' (=int).
But of course it's OK to use unsigned char.

>> +0:read with a delimeter >= 0x80

> 
> There's a typo in "delimiter"

Thanks.
# I was thinking 'delimeter' was the _correct_ spelling ;)

> The patch below needs to be applied on top of your patch.

Could you push all the patch?

But:
> It adds a few
> more test cases,
(snip)
> additional problem I was hitting when trying to reproduce the original
> problem. ...

Sorry, I missed your first post, and I din't considered the real
problem with multibyte locale.

> --- a/Test/B04read.ztst
> +++ b/Test/B04read.ztst
(snip)
> +  read -ed $'\xc2'
> +0:read delimited by a single byte terminates if the byte is part of a multibyte character
> +<one£two
> +>one

Is this really what the standard requires (or will require)?
Breaking in the middle of a valid multibyte character looks
rather odd to me.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
  2022-12-15 12:37         ` Jun. T
@ 2022-12-16  8:29           ` Oliver Kiddle
  2022-12-18 10:51             ` Jun. T
  0 siblings, 1 reply; 10+ messages in thread
From: Oliver Kiddle @ 2022-12-16  8:29 UTC (permalink / raw)
  To: Jun. T; +Cc: zsh-workers

"Jun. T" wrote:
> > --- a/Test/B04read.ztst
> > +++ b/Test/B04read.ztst
> (snip)
> > +  read -ed $'\xc2'
> > +0:read delimited by a single byte terminates if the byte is part of a multibyte character
> > +<one£two
> > +>one
>
> Is this really what the standard requires (or will require)?
> Breaking in the middle of a valid multibyte character looks
> rather odd to me.

The proposed standard wording appears to only talk about the case of the
delimiter consisting of "one single-byte character". $'\xc2' is not a
valid UTF-8 character so my interpretation is that they are leaving this
undefined.

Behaviour that treats the input as raw bytes for a raw byte delimiter
is consistent. This retains compatibility with the way things
work for a non-multibyte locale. Not all files are valid UTF-8 and it
can be useful to force things to work at a raw byte level.

The only alternative I can think of would be to print an error for the
delimiter. Did you have something else in mind?

Oliver

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
  2022-12-16  8:29           ` Oliver Kiddle
@ 2022-12-18 10:51             ` Jun. T
  2022-12-18 17:58               ` Stephane Chazelas
  0 siblings, 1 reply; 10+ messages in thread
From: Jun. T @ 2022-12-18 10:51 UTC (permalink / raw)
  To: zsh-workers


> 2022/12/16 17:29, Oliver Kiddle <opk@zsh.org> wrote:
> 
>>> +  read -ed $'\xc2'
>>> +0:read delimited by a single byte terminates if the byte is part of a multibyte character
>>> +<one£two
>>> +>one
>> 
>> Is this really what the standard requires (or will require)?
>> Breaking in the middle of a valid multibyte character looks
>> rather odd to me.
> 
> The proposed standard wording appears to only talk about the case of the
> delimiter consisting of "one single-byte character". $'\xc2' is not a
> valid UTF-8 character so my interpretation is that they are leaving this
> undefined.

I thought the "one single-byte character" etc. applies only when C or
POSIX locale is in use.

> Behaviour that treats the input as raw bytes for a raw byte delimiter
> is consistent. This retains compatibility with the way things
> work for a non-multibyte locale. Not all files are valid UTF-8 and it
> can be useful to force things to work at a raw byte level.

I was thinking it would be enough if we can do 'byte-by-byte' analysis by
using C/POSIX locale (or by setting MULTIBYTE option to off).

In the web page Stehane mentioned:
https://austingroupbugs.net/view.php?id=243#c6091

"When the current locale is not the C or POSIX locale, pathnames can contain bytes that do not form part of a valid character, and therefore portable applications need to ensure that the current locale is the C or POSIX locale when using read with arbitrary pathnames as input."

But I'm not familiar with this type of documents.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
  2022-12-18 10:51             ` Jun. T
@ 2022-12-18 17:58               ` Stephane Chazelas
  0 siblings, 0 replies; 10+ messages in thread
From: Stephane Chazelas @ 2022-12-18 17:58 UTC (permalink / raw)
  To: Jun. T; +Cc: zsh-workers

2022-12-18 19:51:22 +0900, Jun. T:
[...]
> In the web page Stehane mentioned:
> https://austingroupbugs.net/view.php?id=243#c6091
> 
> "When the current locale is not the C or POSIX locale,
> pathnames can contain bytes that do not form part of a valid
> character, and therefore portable applications need to ensure
> that the current locale is the C or POSIX locale when using
> read with arbitrary pathnames as input."
> 
> But I'm not familiar with this type of documents.
[...]

In the POSIX terminology, "application" identifies users of the
API. So sh/read users/scripts, code that use the "read" utility
here while "implementation" identifies a compliant
implementation of the API supplier, so a sh implementation like
zsh in sh emulation.

Here, it says that applications (scripts) wanting to use "read"
to read a file path into a variable should use:

IFS= LC_ALL=C read -rd '' file

I argued there in comments below
(https://austingroupbugs.net/view.php?id=243#c6093,
https://austingroupbugs.net/view.php?id=243#c6095) that with
IFS= read -d '', read implementations in practice don't seem to
need LC_ALL=C to do the right thing.

-- 
Stephane

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-12-18 17:59 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-09 15:42 read -d $'\200' doesn't work with set +o multibyte Stephane Chazelas
2022-12-09 20:05 ` Oliver Kiddle
2022-12-10  9:06   ` read -d $'\200' doesn't work with set +o multibyte (and [PATCH]) Stephane Chazelas
2022-12-13 11:12     ` Jun T
2022-12-14 21:42       ` Oliver Kiddle
2022-12-15 12:37         ` Jun. T
2022-12-16  8:29           ` Oliver Kiddle
2022-12-18 10:51             ` Jun. T
2022-12-18 17:58               ` Stephane Chazelas
2022-12-15  2:01     ` Oliver Kiddle

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).