From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <5a7299b417393fc99fd462617a8a51c2@quanstro.net>
From: erik quanstrom <quanstro@quanstro.net>
Date: Mon,  3 Mar 2008 19:13:48 -0500
To: 9fans@cse.psu.edu
Subject: Re: [9fans] awk, not utf aware...
In-Reply-To: <6e35c0620803031548v583c051flb4d9ace76e220998@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Topicbox-Message-UUID: 6deab246-ead3-11e9-9d60-3106f5b1d025

> > On the LINUX machines running utf-8 the =C3=A4 is coded as $C3A4 whic=
h is
> > in utf-8 equal to the character E4. The =C3=A4 occupies in that way 2=
 bytes.
> >
> > I was very astonished, when I copied a mac-filename, pasted into a
> > texteditor and looked at the file:
> >
> > In the mac-filename the letter =C3=A4 is coded as: $61CC88, which in =
utf-8
> > means the letter "a" followed by a $0308. (Combining diacritical mark=
s)
> > So the Mac combines the letter a with the two points above it instead
> > using the E4 letter
> > Now the things are clear: The filenames are different, in spite of
> > looking equally.
>=20
> So, if folding codepoints is a reasonable tactic, how many
> representations do you need to fold?  How many binary representations
> are needed to fold =C3=AD=C3=AF=C3=AC=C3=AEi -> i?

i didn't make my point very well.  in this case i was suggesting a -f fla=
g
for grep that would map a codepoints into their base codepoint.  the matc=
h
result would be the original text --- in the manner of the -i flag.

seperately, however ...

utf combining characters are a really unfortunate choice, imho.  there
is no limit to the number of combining codepoints one can add to
a base codepoint.  you can, for example build a single letter like this
	U+0061 U+0302 ... U+0302
i don't think it's possible to build legible glyphs from bitmaps using
combining diacriticals.

therefore, i would argue for reducing letters made up of base+combiners
to a precombined codepoint whenever possible.  it would be helpful
if tcs did this.  infortunately some transliterations of russian into the=
 roman
alphabet use characters with no precombined form in unicode.

rob probablly has a more informed opinion on this than i.

- erik