From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <6e35c0620803031548v583c051flb4d9ace76e220998@mail.gmail.com>
Date: Mon,  3 Mar 2008 14:48:01 -0900
From: "Jack Johnson" <knapjack@gmail.com>
To: "Fans of the OS Plan 9 from Bell Labs" <9fans@cse.psu.edu>
Subject: Re: [9fans] awk, not utf aware...
In-Reply-To: <6e3f8d391c4219c9b0acb323d74603d6@quanstro.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
References: <6e3f8d391c4219c9b0acb323d74603d6@quanstro.net>
Topicbox-Message-UUID: 6de463fa-ead3-11e9-9d60-3106f5b1d025

On Thu, Feb 28, 2008 at 6:10 AM, erik quanstrom <quanstro@quanstro.net> wro=
te:
>  perhaps it would be more effective to break down the concept
>  a bit.  instead of a general locale hammer, why not expose some
>  operations that could go into a locale?  for example, have a base-
>  character folding switch that allows regexps to fold codpoints into
>  base codepoints so that =ED=EF=EC=EEi -> i.  this information is in the =
unicode
>  tables.  perhaps the language-dependent character mapping should
>  be specified explictly. &c.

Loosely-related tangent:

http://www.mail-archive.com/rsync@lists.samba.org/msg20395.html

> On the LINUX machines running utf-8 the =E4 is coded as $C3A4 which is
> in utf-8 equal to the character E4. The =E4 occupies in that way 2 bytes.
>
> I was very astonished, when I copied a mac-filename, pasted into a
> texteditor and looked at the file:
>
> In the mac-filename the letter =E4 is coded as: $61CC88, which in utf-8
> means the letter "a" followed by a $0308. (Combining diacritical marks)
> So the Mac combines the letter a with the two points above it instead
> using the E4 letter
> Now the things are clear: The filenames are different, in spite of
> looking equally.

So, if folding codepoints is a reasonable tactic, how many
representations do you need to fold?  How many binary representations
are needed to fold =ED=EF=EC=EEi -> i?

-Jack