From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <6e3f8d391c4219c9b0acb323d74603d6@quanstro.net>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] awk, not utf aware...
From: erik quanstrom <quanstro@quanstro.net>
Date: Thu, 28 Feb 2008 10:10:29 -0500
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Topicbox-Message-UUID: 673c491e-ead3-11e9-9d60-3106f5b1d025

i had to dig this off 9fans.net/archive.  htmlfmt does some very bad thin=
gs
with non-ascii characters.  i hope i put them back correctly.

> Yes, and then there is locale: does [a-z] include =C4=B3 when you run i=
t
> in Holland (it should)?  Does it include =C3=A1, =C3=A8, =C3=B4 in Fran=
ce (it should)?
> Does it include =C3=B8, =C3=A5 in Norway (it should not)?  And what hap=
pens when
> you evaluate "=C3=A8"< "o" (it depends)?
>=20
> Fixing awk is much harder than anyone things.  I had a chat about it wi=
th
> Brian Kernighan and he says he's been thinking about fixing awk for a
> long time, but that it really is a hard problem.

how does a program know where it's being run?  =E2=98=BA how do you write=
 a
program that processes byte streams from a dutch user and from a
norwegian?  how does one deal with a multi-language file.

i see some problems with localized regexps.  like pre-utf character
sets, it's impossible to tell from a byte stream what the character
set is.  two users can run the same program and get different results.
(how do you test in an environment like this?) and, of course, you
can't switch locale within a file making multi-language files
difficult.

perhaps it would be more effective to break down the concept
a bit.  instead of a general locale hammer, why not expose some
operations that could go into a locale?  for example, have a base-
character folding switch that allows regexps to fold codpoints into
base codepoints so that =C3=AD=C3=AF=C3=AC=C3=AEi -> i.  this information=
 is in the unicode
tables.  perhaps the language-dependent character mapping should
be specified explictly. &c.

- erik