From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.3 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 28497 invoked from network); 10 Dec 2022 09:06:50 -0000 Received: from zero.zsh.org (2a02:898:31:0:48:4558:7a:7368) by inbox.vuxu.org with ESMTPUTF8; 10 Dec 2022 09:06:50 -0000 ARC-Seal: i=1; cv=none; a=rsa-sha256; d=zsh.org; s=rsa-20210803; t=1670663210; b=eprDao4lNrIHlrPhsdd9RpfI+H5fieLucG+WIZHG/Nw+Rhmp55WcmAA+bEP1Ycay60WCVoRIkW hVwRPREE28e+n/3a6gRiGa3TFMzEsEwVdYywXEeJmqApENqAWO+w4CXeySsnzDe7sXRaAlebar J7NvXFgaP6azNnPCcuInyvwbqulO3oZG54fJwvOE9MAPfF8aE+5wik04DT/BXFKk3CSHhxEHBx u/vY5bJNag/3enuAm0owGBdMFjNjaVa/BcVq6nErImVYnbwkDfM0Hb7wwVB6Lia4MMh5tlwAz9 QU79NjMGn9Ew6EIJf9ooPZpb1bkbPB5rHSXM+isK+N8hvA==; ARC-Authentication-Results: i=1; zsh.org; iprev=pass (relay1-d.mail.gandi.net) smtp.remote-ip=217.70.183.193; dmarc=none header.from=chazelas.org; arc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed; d=zsh.org; s=rsa-20210803; t=1670663210; bh=PK2mFxev6QErBVmgT5KyaDt2FxymngCYuhFQtBqtYck=; h=List-Archive:List-Owner:List-Post:List-Unsubscribe:List-Subscribe:List-Help: List-Id:Sender:In-Reply-To:Content-Transfer-Encoding:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:DKIM-Signature; b=eOGCZZzyLB0qtSCr9HMrCIjKakEUR3DKvU8b5B5xbmM33ULZ26eRpHMgGgGnibqGKVWi7TAgz7 ZPXixza7u4F9fHNqG4S9g4IgrYpuC+bcV9q1sXeSIwVTuiJiCR2PeiEvL6EVdaIRFZBSPNITC3 aXbYJC1piv9/IFttsla9fGH/Ti5cgoygsbsyGgoUSXOF34UVAHc7Ad5TqT5LugwPx6PAsKQLTG kPAyYOB4d3RotC1ANFpip12Yto74IoopbkyNJ7ny7PtHn6w//DjtBkEI8BnORiYJHXkoE8T7jF H4sM2Qk8xwTkdmOTN8hLE1ehSz3uuHcXfMV942fgIsfk0g==; DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=zsh.org; s=rsa-20210803; h=List-Archive:List-Owner:List-Post:List-Unsubscribe: List-Subscribe:List-Help:List-Id:Sender:In-Reply-To:Content-Transfer-Encoding :Content-Type:MIME-Version:References:Message-ID:Subject:Cc:To:From:Date: Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender :Resent-To:Resent-Cc:Resent-Message-ID; bh=qG+r04Keth6sjigVWIaEUmitHtnD+ISGZJY4f1BOuCU=; b=Ig3gKpGu/TEvbMHBXdYclPU5D7 rFXN9mPbg/KHEpJmpBPtwO/6OMxsoBUutx/D2e+LbpHx6cDlTpBsJYxjIS5ua0+OgZDUUDi9hoMOr bCJBbeqW2IYrJ9rDhaHMQf4eN8shpIfjwD3ri0zSit9ND2s7yJsqqD9dHW4CbdOZDs+GH8lanPfJU yvuE3nbAtUV97FADXkjjfvEhkyNibQW6emwb/NDAX9lh+9FZ9oZ6TkHbWYcI68LZ7icx6kiYH7Z9Z V7F/tVBPPU9QYFjx+OQLydV99xtIhUj4CTjAYRumUsVorA2KBKA9R0HFzceNihOEFukGJv0uHiMdb gDw3JZPA==; Received: by zero.zsh.org with local id 1p3von-000Hkj-83; Sat, 10 Dec 2022 09:06:49 +0000 Authentication-Results: zsh.org; iprev=pass (relay1-d.mail.gandi.net) smtp.remote-ip=217.70.183.193; dmarc=none header.from=chazelas.org; arc=none Received: from relay1-d.mail.gandi.net ([217.70.183.193]:38111) by zero.zsh.org with esmtps (TLS1.2:ECDHE-RSA-AES256-GCM-SHA384:256) id 1p3voS-000HSs-3Q; Sat, 10 Dec 2022 09:06:28 +0000 Received: (Authenticated sender: stephane@chazelas.org) by mail.gandi.net (Postfix) with ESMTPSA id 0F7AA240006; Sat, 10 Dec 2022 09:06:26 +0000 (UTC) Date: Sat, 10 Dec 2022 09:06:26 +0000 From: Stephane Chazelas To: Oliver Kiddle Cc: Zsh hackers list Subject: Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH]) Message-ID: <20221210090626.mkv7bxeqnap6awah@chazelas.org> Mail-Followup-To: Oliver Kiddle , Zsh hackers list References: <20221209154225.2z3lbtf422ypnmjx@chazelas.org> <99492-1670616302.663548@1brw.o7tP.wgJL> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <99492-1670616302.663548@1brw.o7tP.wgJL> X-Seq: 51165 Archived-At: X-Loop: zsh-workers@zsh.org Errors-To: zsh-workers-owner@zsh.org Precedence: list Precedence: bulk Sender: zsh-workers-request@zsh.org X-no-archive: yes List-Id: List-Help: , List-Subscribe: , List-Unsubscribe: , List-Post: List-Owner: List-Archive: 2022-12-09 21:05:02 +0100, Oliver Kiddle: > Stephane Chazelas wrote: > > Even in a locale with a single-byte charmap, when multibyte is > > off, I can't make read -d work when the delimiter is a byte >= > > 0x80. > > In my testing, it does work in a single-byte locale. I tested on > multiple systems. > > Looking at the multibyte implementation of read, the approach taken > is to use a wchar_t for the delimiter and then maintain mbstate_t for > the input. This supports a delimiter that can be any single unicode > codepoint. In my testing this is working as intended. But note that \351 > alone is incomplete in UTF-8 terms so what wchar_t value should that be > mapped to. Note that here I'm talking of the case where multibyte is *disabled* (zsh +o multibyte), and where UTF-8 (or any other multibyte charset) is nowhere in the picture. As I said, with multibyte on, it works for valid characters; in iso8859-15 on GNU systems, that's 0..0x7f, 0xa0..0xff. IIRC In other areas of the code, bytes that can't be decoded into characters are decoded as 0xdc00 + byte. $ grep -rnwi 0xdc00 . ./ChangeLog:12625: invalid multibyte characters to 0xDC00 + index which is invalid ./Src/pattern.c:242: ((wchar_t) (0xDC00 + STOUC(ch))) See workers/36411 workers/36415 It would be great if something like that was done everywhere so we can always deal with arbitrary arrays of bytes regardless of the locale. > Also interesting to consider is the range \x7f to \x9f in an ISO-8859-x > locale. Those are duplicates of the control characters. In my testing > with a single-byte locale \x89 as a delimiter will end input at a tab > character but the converse (\t as a delimiter) will not terminate at > \x89 in the input. I can't reproduce here: ~$ LC_ALL=en_GB.iso885915 zsh +o multibyte -c "IFS= read -rd $'\x89' a <<< $'a\tb'; print -rn -- \$a" | hd 00000000 61 09 62 0a |a.b.| 00000004 ~$ LC_ALL=en_GB.iso885915 zsh -c "IFS= read -rd $'\x89' a <<< $'a\tb'; print -rn -- \$a" | hd 00000000 61 09 62 0a |a.b.| 00000004 ~$ LC_ALL=en_GB.UTF-8 zsh -c "IFS= read -rd $'\x89' a <<< $'a\tb'; print -rn -- \$a" | hd 00000000 61 09 62 0a |a.b.| 00000004 ~$ LC_ALL=en_GB.UTF-8 zsh +o multibyte -c "IFS= read -rd $'\x89' a <<< $'a\tb'; print -rn -- \$a" | hd 00000000 61 09 62 0a |a.b.| 00000004 > My understanding of the proposed POSIX wording is that it requires > the individual octet, regardless of any character mapping to be the > delimiter. Does anyone track the austin list? Would be good if they can > be persuaded to relax what they specify. The part I especially object to > is requiring that the input does not contain null bytes. The fact that > zsh can cope with nulls is often really useful. Why can't they leave > that unspecified? I can understand wanting to standardise a lowest > common denominator but that is punishing an existing richer > implementation. Not sure what you mean. The proposed text has: -d delim If delim consists of one single-byte character, that byte shall be used as the logical line delimiter. If delim is the null string, the logical line delimiter shall be the null byte. Otherwise, the behavior is unspecified. That's added alongside xargs -0 and find's -prin0 to be able to deal with arbitrary file names, so the point is for it to work on input with NULs. The: If the -d delim option is specified and delim consists of one single-byte character other than , the standard input shall contain zero or more characters, shall not contain any null bytes, and (if not empty) shall end with delim. Is a requirement on the *application*, not the implementation. That is, it only specifies what's meant to happen when the input doesn't contain NULs. So I think we're good here. I'm susbscribed to both austin-group-l and zsh-workers but don't follow them very closely. I try to mention things relevant to zsh here when I spot them on austin-group-l and I try to argue there about things that would conflict with the zsh way for no good reason. austin-group-l is not large volume, I would recommend at least one zsh developer get in there. I see the maintainers of FreeBSD sh, NetBSD sh, mksh, bash at least occasionally contributing there. You can also get an account on their bug tracker. I've got one and I'm not the maintainer of any software relevant to POSIX. Changes in the bug tracker are posted to the ML. It's often preferable to add a comment on a ticket than post on the ML. > One way forward would be to take the argument to -d as a literal and > potentially multi-byte delimiter. UTF-8 has the property that a valid > sequence can't occur within a longer sequence so for UTF-8 you would not > need to worry about it finding a delimiter within a different > character. This is not the case with combining characters but the > current implementation will also stop at the uncombined character. > There are other multi-byte encodings for which this is not true. I've > no idea how relevant things like EUC-JP and Shift-JIS still are. Things like Shift-JIS are unworkable. I don't expect anyone to still be using them. GB18030 and BIG5/BIG5-HKSCS may still be relevant. They don't work on Shift state like Shift-JIS, but many of their characters have bytes <= 0x7f, and zsh doesn't really work with them for that reason. $ echo αε | iconv -t BIG5-HKSCS | hd 00000000 a3 5c a3 60 0a |.\.`.| 00000005 Simply *having* locales with those charsets opens your system up to security vulnerabilities as you have alpha characters which contain \ and `, special to the shell and many other things. in practice not many things work with them, it's not just zsh; I've noticed newer Debian/Ubuntu doesn't offer locales with them any longer (though you can still generate some if you like). [...] > Should we document the fact that -d '' works like -d $'\0'? Perhaps mark > this as being for compatibility with other shells? Fortunately, it does > work as specified but this may only be by accident. When the -d feature > was added, it was probably only checked that the behaviour with an empty > delimiter was sane. Yes, I agree it's worth documenting. AFAIK, read -d is from ksh93. read -d '' likely works in bash (added there in 2.04) by accident as well (first byte of a NUL-delimited string), it didn't in ksh93. IFS= read -rd '' is a well known coding pattern in bash. read -d '' now works in ksh93u+m and mksh. -d is likely used much more often with '' than with anything else. > > $ LC_ALL=en_GB.iso885915 ./Src/zsh +o multibyte > > $ locale charmap > > ISO-8859-15 > > What do you get with the following, I'd sooner trust this: > zmodload zsh/langinfo; echo $langinfo[CODESET] Same ISO-8859-15 $ LC_ALL=en_GB.iso885915 ./Src/zsh +o multibyte -c "IFS= read -rd $'\351' a <<< a$'\351'b; print -rn -- \$a" | hd 00000000 61 e9 62 0a |a.b.| 00000004 gdb under LC_ALL=en_GB.iso885915 luit 6402 if (OPT_ISSET(ops,'d')) { (gdb) 6403 char *delimstr = OPT_ARG(ops,'d'); (gdb) 6407 if (isset(MULTIBYTE)) { (gdb) 6412 wi = WEOF; (gdb) 6413 if (wi != WEOF) (gdb) 6416 delim = (wchar_t)((delimstr[0] == Meta) ? (gdb) 6417 delimstr[1] ^ 32 : delimstr[0]); (gdb) 6416 delim = (wchar_t)((delimstr[0] == Meta) ? (gdb) 6421 if (SHTTY != -1) { (gdb) p delim $1 = -23 L'\xffffffe9' (gdb) p delimstr $2 = 0x7ffff7fa1790 "é" (as delimstr is a signed char* instead of unsigned char I guess). It works better after: diff --git a/Src/builtin.c b/Src/builtin.c index a7b7755a7..d650ca750 100644 --- a/Src/builtin.c +++ b/Src/builtin.c @@ -6414,9 +6414,9 @@ bin_read(char *name, char **args, Options ops, UNUSED(int func)) delim = (wchar_t)wi; else delim = (wchar_t)((delimstr[0] == Meta) ? - delimstr[1] ^ 32 : delimstr[0]); + STOUC(delimstr[1]) ^ 32 : STOUC(delimstr[0])); #else - delim = (delimstr[0] == Meta) ? delimstr[1] ^ 32 : delimstr[0]; + delim = (delimstr[0] == Meta) ? STOUC(delimstr[1]) ^ 32 : STOUC(delimstr[0]); #endif if (SHTTY != -1) { struct ttyinfo ti; (I don't know if it's the proper way to cast, my C is rusty) Including for bytes that don't map to any character in ISO8859-15: $ LC_ALL=en_GB.iso885915 zsh -c "IFS= read -rd $'\x80' a <<< $'a\x80b'; print -rn -- \$a" | hd 00000000 61 |a| 00000001 So I guess that's the fix for my bug. -- Stephane