From: erik quanstrom <quanstro@quanstro.net>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] simplicity
Date: Wed, 10 Oct 2007 08:22:42 -0400 [thread overview]
Message-ID: <d805c5d583d15a45010664ab8e838326@quanstro.net> (raw)
In-Reply-To: <6e35c0620710092317x5d704601k70090c8ed2cbff0e@mail.gmail.com>
> I was thinking of the simplistic scenario, where someone might be
> looking for niño in some file, regardless of what locale they might
> happen to be in. Now I can imagine the nightmare it must be for
> non-English speakers looking for letter combinations irrespective of
> accents.
>
> But, it seems more like a problem with the shorthand than grep, per
> se.
i agree with this. or it's a historical problem with the character set.
clearly if you were designing a universial character set with no compatability
constraints, the alphabet would have nñ together so [a-z] would
match both.
> I could see an argument for [:alpha:] potentially matching n and
> ñ depending on the locale, but [a-z] not matching ñ in any locale. But
> even that, my tendency would be that [:alpha:] match ñ in every
> locale.
>
> But then, does [:alpha:] match ἄγαθος? How ironic that it doesn't match α.
i don't think one can go this route. you can't have a magic environment
variable that changes everything. testing is a nightmare in such a world.
you have to go through every combination of (data cs, locale) to see if
things are working.
a better solution is to use the properties of unicode. ñ is noted in the
table as
00f1;latin small letter n with tilde;ll;0;l;006e 0303;;;;n;latin small letter n tilde;;00d1;;00d1
field 6 has the base codepoint 006e as its first subfield. it would not be hard
to build a table quickly mapping a codepoint to its base codepoint σ.
but it would probablly be most useful to also have a mapping from
base codepoints to all composed forms ξ.
suppose, for lack of creativity, we use » to mean all base codepoints
matching the next item character so »a matches ä as does »[a-z].
so for » of a letter c can be grepped by taking ξσ(c) which results
in a character class.
plan 9 already has some of this in the c library with tolowerrune, etc.
i did some work with this some time ago and wrote some rc scripts to
generate the to*rune tables from the unicode standard data. it would
be easy to adapt them to generate ξ and σ. (the tables would be pretty big.)
>
> What an ugly problem.
it can be made ugly quickly. but i'm not convinced that all approaches
to this problem are bad.
- erik
next prev parent reply other threads:[~2007-10-10 12:22 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-09-16 18:55 Francisco J Ballesteros
2007-09-16 20:42 ` Anant Narayanan
2007-09-16 21:24 ` Francisco J Ballesteros
2007-09-17 15:22 ` Douglas A. Gwyn
2007-09-16 20:43 ` roger peppe
2007-09-16 20:53 ` Steve Simon
2007-09-17 15:22 ` Douglas A. Gwyn
2007-09-17 20:00 ` Scott Schwartz
2007-09-17 3:23 ` erik quanstrom
2007-09-17 15:22 ` Douglas A. Gwyn
2007-09-17 15:55 ` erik quanstrom
2007-09-18 8:38 ` Douglas A. Gwyn
2007-09-18 10:45 ` dave.l
2007-09-18 14:44 ` Iruata Souza
2007-09-18 15:41 ` Douglas A. Gwyn
2007-09-18 21:34 ` Iruata Souza
2007-10-10 3:30 ` Jack Johnson
2007-10-10 4:02 ` erik quanstrom
2007-10-10 6:17 ` Jack Johnson
2007-10-10 12:22 ` erik quanstrom [this message]
2007-09-18 15:27 ` Rob Pike
2007-09-18 15:38 ` Uriel
2007-09-19 8:50 ` Douglas A. Gwyn
2007-09-19 11:51 ` erik quanstrom
2007-09-19 15:02 ` Russ Cox
2007-09-19 14:17 ` Charles Forsyth
2007-09-19 14:21 ` Iruata Souza
2007-09-19 15:32 ` Skip Tavakkolian
2007-10-09 20:08 ` Aharon Robbins
2007-10-09 21:08 ` Uriel
2007-10-10 5:33 ` sqweek
2007-10-10 11:49 ` erik quanstrom
2007-09-17 14:52 ` ron minnich
2007-09-17 14:53 ` ron minnich
2007-10-10 7:36 John Stalker
2007-10-10 8:24 ` Charles Forsyth
2007-10-10 11:47 ` erik quanstrom
2007-10-10 14:05 ` John Stalker
2007-10-10 14:29 ` erik quanstrom
2007-10-10 15:26 ` John Stalker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d805c5d583d15a45010664ab8e838326@quanstro.net \
--to=quanstro@quanstro.net \
--cc=9fans@cse.psu.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).