help / color / mirror / code / Atom feed
From: Oliver Kiddle <opk@zsh.org>
To: Zsh hackers list <zsh-workers@zsh.org>
Subject: Re: read -d $'\200' doesn't work with set +o multibyte
Date: Fri, 09 Dec 2022 21:05:02 +0100	[thread overview]
Message-ID: <99492-1670616302.663548@1brw.o7tP.wgJL> (raw)
In-Reply-To: <20221209154225.2z3lbtf422ypnmjx@chazelas.org>

Stephane Chazelas wrote:
> Even in a locale with a single-byte charmap, when multibyte is
> off, I can't make read -d work when the delimiter is a byte >=
> 0x80.

In my testing, it does work in a single-byte locale. I tested on
multiple systems.

Looking at the multibyte implementation of read, the approach taken
is to use a wchar_t for the delimiter and then maintain mbstate_t for
the input. This supports a delimiter that can be any single unicode
codepoint. In my testing this is working as intended. But note that \351
alone is incomplete in UTF-8 terms so what wchar_t value should that be
mapped to.

Also interesting to consider is the range \x7f to \x9f in an ISO-8859-x
locale. Those are duplicates of the control characters. In my testing
with a single-byte locale \x89 as a delimiter will end input at a tab
character but the converse (\t as a delimiter) will not terminate at
\x89 in the input.

My understanding of the proposed POSIX wording is that it requires
the individual octet, regardless of any character mapping to be the
delimiter. Does anyone track the austin list? Would be good if they can
be persuaded to relax what they specify. The part I especially object to
is requiring that the input does not contain null bytes. The fact that
zsh can cope with nulls is often really useful. Why can't they leave
that unspecified? I can understand wanting to standardise a lowest
common denominator but that is punishing an existing richer

One way forward would be to take the argument to -d as a literal and
potentially multi-byte delimiter. UTF-8 has the property that a valid
sequence can't occur within a longer sequence so for UTF-8 you would not
need to worry about it finding a delimiter within a different
character. This is not the case with combining characters but the
current implementation will also stop at the uncombined character.
There are other multi-byte encodings for which this is not true. I've
no idea how relevant things like EUC-JP and Shift-JIS still are.

A side effect of this would be support for strings of quite distinct
characters as a multi-character delimiter.

Should we document the fact that -d '' works like -d $'\0'? Perhaps mark
this as being for compatibility with other shells? Fortunately, it does
work as specified but this may only be by accident. When the -d feature
was added, it was probably only checked that the behaviour with an empty
delimiter was sane.

> $ LC_ALL=en_GB.iso885915 ./Src/zsh +o multibyte
> $ locale charmap
> ISO-8859-15

What do you get with the following, I'd sooner trust this:
  zmodload zsh/langinfo; echo $langinfo[CODESET]


  reply	other threads:[~2022-12-09 20:05 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-09 15:42 Stephane Chazelas
2022-12-09 20:05 ` Oliver Kiddle [this message]
2022-12-10  9:06   ` read -d $'\200' doesn't work with set +o multibyte (and [PATCH]) Stephane Chazelas
2022-12-13 11:12     ` Jun T
2022-12-14 21:42       ` Oliver Kiddle
2022-12-15 12:37         ` Jun. T
2022-12-16  8:29           ` Oliver Kiddle
2022-12-18 10:51             ` Jun. T
2022-12-18 17:58               ` Stephane Chazelas
2022-12-15  2:01     ` Oliver Kiddle

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=99492-1670616302.663548@1brw.o7tP.wgJL \
    --to=opk@zsh.org \
    --cc=zsh-workers@zsh.org \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).