From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <d805c5d583d15a45010664ab8e838326@quanstro.net>
From: erik quanstrom <quanstro@quanstro.net>
Date: Wed, 10 Oct 2007 08:22:42 -0400
To: 9fans@cse.psu.edu
Subject: Re: [9fans] simplicity
In-Reply-To: <6e35c0620710092317x5d704601k70090c8ed2cbff0e@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Topicbox-Message-UUID: cdb1c72e-ead2-11e9-9d60-3106f5b1d025

> I was thinking of the simplistic scenario, where someone might be
> looking for ni=C3=B1o in some file, regardless of what locale they migh=
t
> happen to be in.  Now I can imagine the nightmare it must be for
> non-English speakers looking for letter combinations irrespective of
> accents.
>=20
> But, it seems more like a problem with the shorthand than grep, per
> se.

i agree with this.  or it's a historical problem with the character set.
clearly if you were designing a universial character set with no compatab=
ility
constraints, the alphabet would have n=C3=B1 together so [a-z] would=20
match both.

> I could see an argument for [:alpha:] potentially matching n and
> =C3=B1 depending on the locale, but [a-z] not matching =C3=B1 in any lo=
cale. But
> even that, my tendency would be that [:alpha:] match =C3=B1 in every
> locale.
>=20
> But then, does [:alpha:] match =E1=BC=84=CE=B3=CE=B1=CE=B8=CE=BF=CF=82?=
  How ironic that it doesn't match =CE=B1.

i don't think one can go this route.  you can't have a magic environment
variable that changes everything.  testing is a nightmare in such a world=
.
you have to go through every combination of (data cs, locale) to see if
things are working.

a better solution is to use the properties of unicode.  =C3=B1 is noted i=
n the
table as

00f1;latin small letter n with tilde;ll;0;l;006e 0303;;;;n;latin small le=
tter n tilde;;00d1;;00d1

field 6 has the base codepoint 006e as its first subfield.  it would not =
be hard
to build a table quickly mapping a codepoint to its base codepoint =CF=83=
.
but it would probablly be most useful to also have a mapping from
base codepoints to all composed forms =CE=BE.

suppose, for lack of creativity, we use =C2=BB to mean all base codepoint=
s
matching the next item character so =C2=BBa matches =C3=A4 as does =C2=BB=
[a-z].
so for =C2=BB of a letter c can be grepped by taking =CE=BE=CF=83(c) whic=
h results
in a character class.

plan 9 already has some of this in the c library with tolowerrune, etc.
i did some work with this some time ago and wrote some rc scripts to
generate the to*rune tables from the unicode standard data.  it would
be easy to adapt them to generate =CE=BE and =CF=83.  (the tables would b=
e pretty big.)

>=20
> What an ugly problem.

it can be made ugly quickly.  but i'm not convinced that all approaches
to this problem are bad.

- erik