From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <e53cd4b0faac1f3090800c8190d4958c@quanstro.net>
From: erik quanstrom <quanstro@quanstro.net>
Date: Wed, 10 Oct 2007 00:02:38 -0400
To: 9fans@cse.psu.edu
Subject: Re: [9fans] simplicity
In-Reply-To: <6e35c0620710092030u1187029dhf54f67e48a62b85c@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Topicbox-Message-UUID: cd87f2c8-ead2-11e9-9d60-3106f5b1d025

> Yes, old thread, sorry.  Blame Uriel.
>=20
> On 9/18/07, Douglas A. Gwyn <DAGwyn@null.net> wrote:
> > erik quanstrom wrote:
> > > suppose Linux user a and user b grep the same "text" file for the s=
ame string.
> > > results will depend on the users' locales.
> >
> > But if they're trying to match an alphabetic character class, the
> > result *should* depend on the locale.
>=20
> This baffles me.  Can anyone think of examples where one might want
> differing results depending on your locale?
>=20
> -Jack

i think i see what the reasoning is.  the thought is that, e.g.,
in spanish [a-z] should match =C3=B1. =20

the problem is this means that grep(regexp, data) now
returns a set of results, one for each locale.

so on the one hand, one would like [a-z] to do the Right Thing,
depending on language.  and on the other hand, one wants
grep(regexp, data) to return a single result.

i think the way to see through this issue is to notice that
the reason we want =C3=B1 to be in [a-z] is because of visual
similarity.  what if we were dealing with chinese?  i think
it's pretty clear that [a-z] should map to a contiguous set
of unicode codepoints.

if you want to deal with =C3=B1, the unicode tables do note that =C3=B1
is n+combining ~, so one could come up with a new
denotation for base codepoint.  unfortunately the combining
that with existing regexp would be a bit painful.

- erik