From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.4 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 2663 invoked from network); 9 Dec 2022 20:05:51 -0000 Received: from zero.zsh.org (2a02:898:31:0:48:4558:7a:7368) by inbox.vuxu.org with ESMTPUTF8; 9 Dec 2022 20:05:51 -0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=zsh.org; s=rsa-20210803; h=List-Archive:List-Owner:List-Post:List-Unsubscribe: List-Subscribe:List-Help:List-Id:Sender:Message-ID:Date:Content-ID: Content-Type:MIME-Version:Subject:To:References:From:In-reply-to:Reply-To:Cc: Content-Transfer-Encoding:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID; bh=EbL4CeA1MWOKZiBP2SdSQtE29H6ivTsW3nFL5h5vnzY=; b=FDu/ZjVAXW7yT5mQNMBMinnSy4 H4gHlJBBgG8jN21uJGO1s/bbKFfoi+EjAet6iC6VxbAKhwUdeRvEUHYEK/iLsB2erU+IeRU1aGCH9 9D9e6YDfMOwGFhtDpty2K+xsIrImnn/zkW1mRTuEWodwWexyZIyFQFlMjpwJMNqgV2LwbcwNdYxCQ I4c25V8V53S82KdMxXXxKUi5blelssvAgLqeW9+WAGUXG0QrDaoDfelJgm7TggFdaveUbOg6kwix4 RjGWgJOsem4Z4vRrWd/EIHvLASEC3PF9gWwq9PLdFjNRL0afwLiAJYwuFKKlAHtE567ryDv9fjx1W +t3bM68A==; Received: by zero.zsh.org with local id 1p3jcz-000KED-FV; Fri, 09 Dec 2022 20:05:49 +0000 Received: by zero.zsh.org with esmtpsa (TLS1.3:TLS_AES_256_GCM_SHA384:256) id 1p3jcF-000JvW-IT; Fri, 09 Dec 2022 20:05:03 +0000 Received: from [192.168.178.21] (helo=hydra) by mail.kiddle.eu with esmtp(Exim 4.95) (envelope-from ) id 1p3jcE-000Psj-Ld for zsh-workers@zsh.org; Fri, 09 Dec 2022 21:05:02 +0100 In-reply-to: <20221209154225.2z3lbtf422ypnmjx@chazelas.org> From: Oliver Kiddle References: <20221209154225.2z3lbtf422ypnmjx@chazelas.org> To: Zsh hackers list Subject: Re: read -d $'\200' doesn't work with set +o multibyte MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <99491.1670616302.1@hydra> Date: Fri, 09 Dec 2022 21:05:02 +0100 Message-ID: <99492-1670616302.663548@1brw.o7tP.wgJL> X-Seq: 51159 Archived-At: X-Loop: zsh-workers@zsh.org Errors-To: zsh-workers-owner@zsh.org Precedence: list Precedence: bulk Sender: zsh-workers-request@zsh.org X-no-archive: yes List-Id: List-Help: , List-Subscribe: , List-Unsubscribe: , List-Post: List-Owner: List-Archive: Stephane Chazelas wrote: > Even in a locale with a single-byte charmap, when multibyte is > off, I can't make read -d work when the delimiter is a byte >= > 0x80. In my testing, it does work in a single-byte locale. I tested on multiple systems. Looking at the multibyte implementation of read, the approach taken is to use a wchar_t for the delimiter and then maintain mbstate_t for the input. This supports a delimiter that can be any single unicode codepoint. In my testing this is working as intended. But note that \351 alone is incomplete in UTF-8 terms so what wchar_t value should that be mapped to. Also interesting to consider is the range \x7f to \x9f in an ISO-8859-x locale. Those are duplicates of the control characters. In my testing with a single-byte locale \x89 as a delimiter will end input at a tab character but the converse (\t as a delimiter) will not terminate at \x89 in the input. My understanding of the proposed POSIX wording is that it requires the individual octet, regardless of any character mapping to be the delimiter. Does anyone track the austin list? Would be good if they can be persuaded to relax what they specify. The part I especially object to is requiring that the input does not contain null bytes. The fact that zsh can cope with nulls is often really useful. Why can't they leave that unspecified? I can understand wanting to standardise a lowest common denominator but that is punishing an existing richer implementation. One way forward would be to take the argument to -d as a literal and potentially multi-byte delimiter. UTF-8 has the property that a valid sequence can't occur within a longer sequence so for UTF-8 you would not need to worry about it finding a delimiter within a different character. This is not the case with combining characters but the current implementation will also stop at the uncombined character. There are other multi-byte encodings for which this is not true. I've no idea how relevant things like EUC-JP and Shift-JIS still are. A side effect of this would be support for strings of quite distinct characters as a multi-character delimiter. Should we document the fact that -d '' works like -d $'\0'? Perhaps mark this as being for compatibility with other shells? Fortunately, it does work as specified but this may only be by accident. When the -d feature was added, it was probably only checked that the behaviour with an empty delimiter was sane. > $ LC_ALL=en_GB.iso885915 ./Src/zsh +o multibyte > $ locale charmap > ISO-8859-15 What do you get with the following, I'd sooner trust this: zmodload zsh/langinfo; echo $langinfo[CODESET] Oliver