From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 18617 invoked by alias); 14 May 2018 12:34:34 -0000 Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Workers List List-Post: List-Help: List-Unsubscribe: X-Seq: 42773 Received: (qmail 208 invoked by uid 1010); 14 May 2018 12:34:34 -0000 X-Qmail-Scanner-Diagnostics: from mail-wm0-f41.google.com by f.primenet.com.au (envelope-from , uid 7791) with qmail-scanner-2.11 (clamdscan: 0.99.2/21882. spamassassin: 3.4.1. Clear:RC:0(74.125.82.41):SA:0(-1.9/5.0):. Processed in 2.765526 secs); 14 May 2018 12:34:34 -0000 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_PASS, T_DKIM_INVALID autolearn=ham autolearn_force=no version=3.4.1 X-Envelope-From: stephane.chazelas@gmail.com X-Qmail-Scanner-Mime-Attachments: | X-Qmail-Scanner-Zip-Files: | DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:mail-followup-to:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=7cphk/q93EN/woL5x9eiZiLqHHccMfbtPcyJw0msWfE=; b=i1EXYfyYFYbZB2pIOkfrhgm35+gG0yrtCMKg+Z1dyHGcgAG4sD4m+CeOlJZvg88+nE tXziu0rX5QXdhwMYxeMyvqIisiZ2Wm5glHMp+zkZfV6nP7LFdElwN2uHkVIz6E1ivvRB i9YO+3ubAVDtI4v3im5p8LjLiaASBHXAM7+C2WWs2jopUBER30HrNpxcgJ2jIreR9ovl xjDCbBE2c+W/PDbDiLSArbq5UITv+Q5E48MKoqV9DkC2IeEpmf6SBrxV4yzJFTJF333e hZ/psUG2QYQgDiTxSkEefFhBFlSI4nnveEU3Mx/RcIVMrxZPV4xkP7ODA/cifdmBSVaz 2rWw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id :mail-followup-to:references:mime-version:content-disposition :content-transfer-encoding:in-reply-to:user-agent; bh=7cphk/q93EN/woL5x9eiZiLqHHccMfbtPcyJw0msWfE=; b=mDxwIrj2x+OmY2jrnawUPHEpXrkxcBcXkN8Hm8c+HNpZ89MKhFH0q88uy1kxpQz1FB +6tztE+q6xdd8ctSldCOdEWIEECxvcTcOmD541Y36j/kDr0KX+J5Q/YHspLzgCYBrrrv WcSigjy/OV4HxQkUFt7iJkpwrnFx6teQAQhJ1eo56ZEhS5+fNiN4s8G7nZOzsNtaqAJ7 hMzCOY59I46Xh2XpEsRFxOXnPe+olSTAjwl8tSzge9+V+Dc+Xz7WEoqH0sY/vjp+DiG/ tCllFO2jWXwj5HzUVgkv7ANRRx+GANzahfw/IAtWl9Bf21Zcu1hBPAijZhO9Cark19/k C0Nw== X-Gm-Message-State: ALKqPwc2JA0/itraukcTl7q/bQJXh35Hqx7S9K1oXa+bxIXHJSscUUr/ mNgs4arF6JYHgAKRUmW/slaCjA== X-Google-Smtp-Source: AB8JxZqBZv1lYgI8ksBQckD/2bgHgbDmaTPr/DrYnQzbqODCHWYTYNgstepPM/P3EuJWQvyz1qcjjg== X-Received: by 2002:a1c:e60b:: with SMTP id d11-v6mr5564077wmh.128.1526301267328; Mon, 14 May 2018 05:34:27 -0700 (PDT) Date: Mon, 14 May 2018 13:34:25 +0100 From: Stephane Chazelas To: Peter Stephenson Cc: Zsh hackers list Subject: Re: [PATCH] [[:blank:]] only matches on SPC and TAB Message-ID: <20180514123425.GA19631@chaz.gmail.com> Mail-Followup-To: Peter Stephenson , Zsh hackers list References: <20180513212553.GA29028@chaz.gmail.com> <20180514063611.GA7263@chaz.gmail.com> <20180514064431.GB7263@chaz.gmail.com> <20180514094733.308bff1a@camnpupstephen.cam.scsc.local> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20180514094733.308bff1a@camnpupstephen.cam.scsc.local> User-Agent: Mutt/1.5.24 (2015-08-30) 2018-05-14 09:47:33 +0100, Peter Stephenson: > On Mon, 14 May 2018 07:44:31 +0100 > Stephane Chazelas wrote: > > Tue Oct 13 21:42:47 1998 Andrew Main > > > > * Doc/Zsh/expn.yo, Src/glob.c: Add the [:blank:] character > > class required by POSIX, which has no corresponding ctype macro. > > > > Which explains why it's not using isblank() and strongly > > suggests that it was not intentional. > > I think that's correct, but I tend to agree with Sebastian that some > caution is required here since it's not necessarily clear what action > with non-ASCII spaces is actually wanted when this is used. I'd be > surprised if it actually broke anything, though. [...] I was going to say that surely, when someone uses [:blank:] that means they want to trust the locale on the definition of "blank", and I can't see why that should be different from other character classes, but I just noticed that the documentation actually says: [:blank:] The character is either space or tab Instead of "horizontal whitespace". And on GNU systems, "isblank(3)" also says its SPC and TAB: Returns true if C is a blank character; that is, a space or a tab. This function was originally a GNU extension, but was added in ISO C99. While iswblank(3) is careful to refer to locale classification. In practice, the only system where I could find a locale with a single-byte charset with "blank" characters other than SPC and TAB was NetBSD. And there, isblank(0xa0) under setlocale() in a locale that uses ISO8859-1 for instance does return true (as POSIX requires if that's how 0xa0 is classified in the locale. However in the same locale, its sh (which is not multibyte aware) outputs no in: case $nbsb in [[:blank:][:space:]]) echo yes;; *) echo no esac (bash outputs yes for both blank and space as POSIX requires). I don't think many people complained when multi-byte support was added and English people were starting to have their [[:alpha:]] match on Greek or Korean letters in addition to English ones (fair enough as "alpha" means the first letter of the Greek alphabet). The main problem if we want to align with other shells and make the shell POSIX compliant is that the documentation currently states explicitely that it matches on space and tab only. The question is would any script be broken if we changed it? People still keep using [a-z] when they mean to match English lower case letters while in effect nowadays, except in zsh and a very few other utilities that match ranges based on code points, that matches on hundreds more (like à, œ, ć, if not ch, fi...), I wouldn't be surprised if people use [[:alnum:]] thinking it only matches on Latin letters without diacritics and Arabic decimal degits. But then again, that still works more or less for them, as they use it anyway against text that only contains English data. To me the correct way to do a strict match against ASCII blanks (or English letters, or ASCII punctuations) would be to use the C locale. -- Stephane