Say I have these files: 0 [2021-08-05--07:58] d,1 0 [2021-08-05--07:59] d,1,first 0 [2021-08-05--07:59] d,1,second I want a glob that captures them all but: $ print -l d,1[^[:digit:]]*(N) d,1,first d,1,second ... I need a second go: $ print -l d,1 d,1 I'm expecting that no character after the '1' in 'd,1' is 'not a digit', so the first filter should capture 'd,1', but as we see, the second filter is needed. It seems that the first filter should read: "not a digit but not nothing either". Can the first filter be improved to grab them all?
> On 05 August 2021 at 16:14 Ray Andrews <rayandrews@eastlink.ca> wrote:
> Say I have these files:
>
> 0 [2021-08-05--07:58] d,1
> 0 [2021-08-05--07:59] d,1,first
> 0 [2021-08-05--07:59] d,1,second
>
> I want a glob that captures them all but:
>
> $ print -l d,1[^[:digit:]]*(N)
> d,1,first
> d,1,second
Yes, indeed it looks like this is available by default (you don't even need EXTENDED_GLOB).
d,1([^[:digit:]]*|)(N)
The (either_this|or_that) matches either what you originally said or nothing ("or_that" here is empty).
There are other ways of doing it, and there's a ksh/bash-compatible way, too, but that's probably
the one I'd immediately reach for.
pws
> On 05 August 2021 at 16:27 Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> > On 05 August 2021 at 16:14 Ray Andrews <rayandrews@eastlink.ca> wrote:
> > Say I have these files:
> >
> > 0 [2021-08-05--07:58] d,1
> > 0 [2021-08-05--07:59] d,1,first
> > 0 [2021-08-05--07:59] d,1,second
> >
> > I want a glob that captures them all but:
> >
> > $ print -l d,1[^[:digit:]]*(N)
> > d,1,first
> > d,1,second
>
> Yes, indeed it looks like this is available by default (you don't even need EXTENDED_GLOB).
>
> d,1([^[:digit:]]*|)(N)
>
> The (either_this|or_that) matches either what you originally said or nothing ("or_that" here is empty).
>
> There are other ways of doing it, and there's a ksh/bash-compatible way, too, but that's probably
> the one I'd immediately reach for.
Actually, come to think of it, I can't resist mention one other one, since it keeps the squiggles
to a bare minimum, though this *does* require "setopt extendedglob".
d,1^[[:digit:]]*
That means "d,1" following by anything that doesn't match the pattern after the "^". Caution
needs to be exercised with more complicated uses of "^", though --- negative match assertions
can be really counterintuitive.
pws
On 2021-08-05 8:27 a.m., Peter Stephenson wrote:
>
> d,1([^[:digit:]]*|)(N)
>
> The (either_this|or_that) matches either what you originally said or nothing ("or_that" here is empty).
>
>
Marvelous, I know the logic of these regex things is very rigorous, so
the 'not digit BUT still something' construction must be handled and of
course there will be a way of saying 'not digit but maybe nothing' too.
Thanks Peter. I tried to get the '|' working myself but didn't get it
quite right.
On 2021-08-05 8:36 a.m., Peter Stephenson wrote:
>> On 05 August 2021 at 16:27 Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
>>
> d,1^[[:digit:]]*
>
>
Cool. I thought zsh always used ' [:digit:]' not the double bracket
form. But at least the above is easy to parse by eye, the caret is
intuitive there. Perhaps we could interpret '[^[:digit:]]' as saying
that since the outermost bracket is outside the caret, there must be
something rather than nothing. Or, if discussing cosmology:
[^[universe]]
; - )
On 2021-08-05 09:05:19 -0700, Ray Andrews wrote: > Cool. I thought zsh always used ' [:digit:]' not the double bracket form. It is not a double-bracket form. The two brackets have different meanings: the first one if for a list of characters, and the second one is for a character class. You can write: [ab[:digit:]cd] -- Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
On Thu, Aug 5, 2021, at 12:05 PM, Ray Andrews wrote:
> On 2021-08-05 8:36 a.m., Peter Stephenson wrote:
> >> On 05 August 2021 at 16:27 Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> >>
> > d,1^[[:digit:]]*
> >
> >
> Cool. I thought zsh always used ' [:digit:]' not the double bracket
> form. But at least the above is easy to parse by eye, the caret is
> intuitive there. Perhaps we could interpret '[^[:digit:]]' as saying
> that since the outermost bracket is outside the caret, there must be
> something rather than nothing.
[^[:digit:]] already means "a character that is not a digit".
--
vq
On 2021-08-05 9:48 a.m., Vincent Lefevre wrote:
> On 2021-08-05 09:05:19 -0700, Ray Andrews wrote:
>> Cool. I thought zsh always used ' [:digit:]' not the double bracket form.
> It is not a double-bracket form. The two brackets have different
> meanings: the first one if for a list of characters, and the second
> one is for a character class. You can write: [ab[:digit:]cd]
>
Thanks. Easy to get that wrong. So it's ' [: ' that should be taken as
a single entity and ':] ' likewise.
On Thu, Aug 5, 2021 at 11:22 AM Ray Andrews <rayandrews@eastlink.ca> wrote:
>
> So it's ' [: ' that should be taken as
> a single entity and ':] ' likewise.
Uh, no. Inside a [...] character class, the entire sequence [:word:]
is a single entity, for certain values of word. Outside of a
character class, [: and :] have no special meaning.
2021-08-05 09:05:19 -0700, Ray Andrews: > On 2021-08-05 8:36 a.m., Peter Stephenson wrote: > > > On 05 August 2021 at 16:27 Peter Stephenson <p.w.stephenson@ntlworld.com> wrote: > > > > > d,1^[[:digit:]]* > > > > > Cool. I thought zsh always used ' [:digit:]' not the double bracket form. [...] I would avoid [[:digit:]] in zsh globs / patterns especially for input validation. [:digit:] within a bracket expression is a POSIX character class, it is a POSIX invention. It would be recognised, but within bracket expressions only by anything specified by POSIX and that uses shell filename patters or regular expressions (basic or extended) such as sh (for globs or case constructs), find (for -name/-path matching) for grep/sed/ed... [X[:digit:]] would match on any character that is either X or any character classified as decimal digit in the locale. What that matches in practice depends on the system and locale. In 2016, someone pointed out to POSIX that isdigit() in the C standard was not locale dependent and matched on 0123456789 only (https://www.austingroupbugs.net/view.php?id=1078), so, to align with that future versions of the standard will restrict [:digit:] to match on 0123456789 only and will forbid to match on any other decimal digits. I wouldn't be surprised if that's later reverted again though as it's quite unintuitive / inconsistent. Still, there are systems where iswdigit() matches on a lot more than 0123456789 in some locales, and as a consequence, the [[:digit:]] of zsh globs and most other tools will too. For instance, on FreeBSD 12.2 and in a en_US.UTF-8 locale, [[:digit:]] matches on 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨੩੪੫੬੭੮੯૦૧૨૩૪૫૬૭૮૯୦୧୨୩୪୫୬୭୮୯௦௧௨௩௪௫௬௭௮௯౦౧౨౩౪౫౬౭౮౯೦೧೨೩೪೫೬೭೮೯൦൧൨൩൪൫൬൭൮൯෦෧෨෩෪෫෬෭෮෯๐๑๒๓๔๕๖๗๘๙໐໑໒໓໔໕໖໗໘໙༠༡༢༣༤༥༦༧༨༩၀၁၂၃၄၅၆၇၈၉႐႑႒႓႔႕႖႗႘႙០១២៣៤៥៦៧៨៩᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᪀᪁᪂᪃᪄᪅᪆᪇᪈᪉᪐᪑᪒᪓᪔᪕᪖᪗᪘᪙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꧐꧑꧒꧓꧔꧕꧖꧗꧘꧙꧰꧱꧲꧳꧴꧵꧶꧷꧸꧹꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙꯰꯱꯲꯳꯴꯵꯶꯷꯸꯹0123456789𐒠𐒡𐒢𐒣𐒤𐒥𐒦𐒧𐒨𐒩𐴰𐴱𐴲𐴳𐴴𐴵𐴶𐴷𐴸𐴹𑁦𑁧𑁨𑁩𑁪𑁫𑁬𑁭𑁮𑁯𑃰𑃱𑃲𑃳𑃴𑃵𑃶𑃷𑃸𑃹𑄶𑄷𑄸𑄹𑄺𑄻𑄼𑄽𑄾𑄿𑇐𑇑𑇒𑇓𑇔𑇕𑇖𑇗𑇘𑇙𑋰𑋱𑋲𑋳𑋴𑋵𑋶𑋷𑋸𑋹𑑐𑑑𑑒𑑓𑑔𑑕𑑖𑑗𑑘𑑙𑓐𑓑𑓒𑓓𑓔𑓕𑓖𑓗𑓘𑓙𑙐𑙑𑙒𑙓𑙔𑙕𑙖𑙗𑙘𑙙𑛀𑛁𑛂𑛃𑛄𑛅𑛆𑛇𑛈𑛉𑜰𑜱𑜲𑜳𑜴𑜵𑜶𑜷𑜸𑜹𑣠𑣡𑣢𑣣𑣤𑣥𑣦𑣧𑣨𑣩𑱐𑱑𑱒𑱓𑱔𑱕𑱖𑱗𑱘𑱙𑵐𑵑𑵒𑵓𑵔𑵕𑵖𑵗𑵘𑵙𑶠𑶡𑶢𑶣𑶤𑶥𑶦𑶧𑶨𑶩𖩠𖩡𖩢𖩣𖩤𖩥𖩦𖩧𖩨𖩩𖭐𖭑𖭒𖭓𖭔𖭕𖭖𖭗𖭘𖭙𝟎𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡𝟢𝟣𝟤𝟥𝟦𝟧𝟨𝟩𝟪𝟫𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿𞥐𞥑𞥒𞥓𞥔𞥕𞥖𞥗𞥘𞥙 All decimal digits, some variations on the 0123456789 Arabic ones, and some other decimal digits in some other scripts. [0-9] itself, in general, is even worse. Not two systems or utilities or library functions and version thereof agree on what characters are ranked between 0 and 9. It could even match on sequences of characters (collating elements). That's not the case of zsh globs though where [0-9] only matches on 0123456789, as ranges in zsh are based on the wide char value of the characters (or byte value if the multibyte option is off), and for those 0123456789 characters specifically, in practice, the wide char values are consecutive and in that order regardless of the locale and system. Beware though that it only applies to zsh globs. It doesn't apply to [0-9] in regexps which use the system's extended regexps matching functions (or pcre with the rematchpcre option; see also \d there). The only thing guaranteed to match only 0123456789 regardless of locale and system is [0123456789], do not use [[:digit:]] or \d for that. In zsh, you can use [0-9] but only with globs. [[ $d = [0-9] ]] && echo is one of 0123456789 is correct (in zsh, not in bash / ksh93) [[ $d =~ '^[0-9]$' && echo is one of 0123456789 is not (at least on some systems/locales). With set -o rematchpcre [[ $d =~ '^[0-9]\Z' && echo is one of 0123456789 should be OK (so would the same with \d, though I wouldn't trust it as it could vary with the version and what flags are passed to the matcher as \d can be told to match other digits under some circumstances). Also beware re matching doesn't work properly on non-text. See also https://www.mail-archive.com/bug-bash@gnu.org/msg25885.html for a glimpse at the (more messier) situation in the bash shell. -- Stephane
On 2021-08-08 8:07 a.m., Stephane Chazelas wrote: > 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨੩੪੫੬੭੮੯૦૧૨૩૪૫૬૭૮૯୦୧୨୩୪୫୬୭୮୯௦௧௨௩௪௫௬௭௮௯౦౧౨౩౪౫౬౭౮౯೦೧೨೩೪೫೬೭೮೯൦൧൨൩൪൫൬൭൮൯෦෧෨෩෪෫෬෭෮෯๐๑๒๓๔๕๖๗๘๙໐໑໒໓໔໕໖໗໘໙༠༡༢༣༤༥༦༧༨༩၀၁၂၃၄၅၆၇၈၉႐႑႒႓႔႕႖႗႘႙០១២៣៤៥៦៧៨៩᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᪀᪁᪂᪃᪄᪅᪆᪇᪈᪉᪐᪑᪒᪓᪔᪕᪖᪗᪘᪙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꧐꧑꧒꧓꧔꧕꧖꧗꧘꧙꧰꧱꧲꧳꧴꧵꧶꧷꧸꧹꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙꯰꯱꯲꯳꯴꯵꯶꯷꯸꯹0123456789𐒠𐒡𐒢𐒣𐒤𐒥𐒦𐒧𐒨𐒩𐴰𐴱𐴲𐴳𐴴𐴵𐴶𐴷𐴸𐴹𑁦𑁧𑁨𑁩𑁪𑁫𑁬𑁭𑁮𑁯𑃰𑃱𑃲𑃳𑃴𑃵𑃶𑃷𑃸𑃹𑄶𑄷𑄸𑄹𑄺𑄻𑄼𑄽𑄾𑄿𑇐𑇑𑇒𑇓𑇔𑇕𑇖𑇗𑇘𑇙𑋰𑋱𑋲𑋳𑋴𑋵𑋶𑋷𑋸𑋹𑑐𑑑𑑒𑑓𑑔𑑕𑑖𑑗𑑘𑑙𑓐𑓑𑓒𑓓𑓔𑓕𑓖𑓗𑓘𑓙𑙐𑙑𑙒𑙓𑙔𑙕𑙖𑙗𑙘𑙙𑛀𑛁𑛂𑛃𑛄𑛅𑛆𑛇𑛈𑛉𑜰𑜱𑜲𑜳𑜴𑜵𑜶𑜷𑜸𑜹𑣠𑣡𑣢𑣣𑣤𑣥𑣦𑣧𑣨𑣩𑱐𑱑𑱒𑱓𑱔𑱕𑱖𑱗𑱘𑱙𑵐𑵑𑵒𑵓𑵔𑵕𑵖𑵗𑵘𑵙𑶠𑶡𑶢𑶣𑶤𑶥𑶦𑶧𑶨𑶩𖩠𖩡𖩢𖩣𖩤𖩥𖩦𖩧𖩨𖩩𖭐𖭑𖭒𖭓𖭔𖭕𖭖𖭗𖭘𖭙𝟎𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡𝟢𝟣𝟤𝟥𝟦𝟧𝟨𝟩𝟪𝟫𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿𞥐𞥑𞥒𞥓𞥔𞥕𞥖𞥗𞥘𞥙 Yikes! A sad tale, seems the situation is anarchy. The multiculturalists have overreached themselves. Arabic numerals only please. I understand that's the rule in China. > That's not the case of zsh globs though where [0-9] only matches > on 0123456789, Good to know. Actually I've already abandoned [:digit:] for no other reason than '[0-9]' seems more native to zsh which, as you show, is quite true. Thanks, that was a most detailed horror story on the state of things.
On Sun, Aug 8, 2021, at 12:19 PM, Ray Andrews wrote:
> Yikes! A sad tale, seems the situation is anarchy. The
> multiculturalists have overreached themselves.
"Anarchy" does not mean "anything I don't expect, like, or understand".
--
vq
On 2021-08-08 09:19:13 -0700, Ray Andrews wrote: > On 2021-08-08 8:07 a.m., Stephane Chazelas wrote: > > That's not the case of zsh globs though where [0-9] only matches > > on 0123456789, > Good to know. Actually I've already abandoned [:digit:] for no other reason > than '[0-9]' seems more native to zsh which, as you show, is quite true. I've always used [0-9], mainly because it is shorter to type than [[:digit:]]. Just in case, I have the following in my .zshenv file: export LC_COLLATE=POSIX (actually, mainly for better filename sorting when filenames contain hyphen-minus characters). IMHO, it is good to have that by default. Specific collation rules should be used only when the context for which they are considered is known. -- Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
On 2021-08-08 3:24 p.m., Vincent Lefevre wrote:
>
> Just in case, I have the following in my .zshenv file:
>
> export LC_COLLATE=POSIX
Wheels within wheels within wheels. Plain country boy like me just
wants something predictable.
One day I might need these:
١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०
... but not now.
2021-08-09 00:24:55 +0200, Vincent Lefevre:
[...]
> Just in case, I have the following in my .zshenv file:
>
> export LC_COLLATE=POSIX
>
> (actually, mainly for better filename sorting when filenames contain
> hyphen-minus characters). IMHO, it is good to have that by default.
> Specific collation rules should be used only when the context for
> which they are considered is known.
[...]
I wonder how reliable / portable it is to have LC_COLLATE
different from LC_CTYPE, especially when using ranges whose ends
are not in the POSIX locale charset (though I don't think I've
ever used such ranges; but users with languages using other
scripts (Greek, Cyrillic...) might).
In any case, note that having that in *your own* .zshenv doesn't
address the potential problems in the script you write and that
may be used by others.
It should also be noted that LC_ALL takes precedence over
LC_COLLATE.
--
Stephane