From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <5a7299b417393fc99fd462617a8a51c2@quanstro.net> From: erik quanstrom Date: Mon, 3 Mar 2008 19:13:48 -0500 To: 9fans@cse.psu.edu Subject: Re: [9fans] awk, not utf aware... In-Reply-To: <6e35c0620803031548v583c051flb4d9ace76e220998@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Topicbox-Message-UUID: 6deab246-ead3-11e9-9d60-3106f5b1d025 > > On the LINUX machines running utf-8 the =C3=A4 is coded as $C3A4 whic= h is > > in utf-8 equal to the character E4. The =C3=A4 occupies in that way 2= bytes. > > > > I was very astonished, when I copied a mac-filename, pasted into a > > texteditor and looked at the file: > > > > In the mac-filename the letter =C3=A4 is coded as: $61CC88, which in = utf-8 > > means the letter "a" followed by a $0308. (Combining diacritical mark= s) > > So the Mac combines the letter a with the two points above it instead > > using the E4 letter > > Now the things are clear: The filenames are different, in spite of > > looking equally. >=20 > So, if folding codepoints is a reasonable tactic, how many > representations do you need to fold? How many binary representations > are needed to fold =C3=AD=C3=AF=C3=AC=C3=AEi -> i? i didn't make my point very well. in this case i was suggesting a -f fla= g for grep that would map a codepoints into their base codepoint. the matc= h result would be the original text --- in the manner of the -i flag. seperately, however ... utf combining characters are a really unfortunate choice, imho. there is no limit to the number of combining codepoints one can add to a base codepoint. you can, for example build a single letter like this U+0061 U+0302 ... U+0302 i don't think it's possible to build legible glyphs from bitmaps using combining diacriticals. therefore, i would argue for reducing letters made up of base+combiners to a precombined codepoint whenever possible. it would be helpful if tcs did this. infortunately some transliterations of russian into the= roman alphabet use characters with no precombined form in unicode. rob probablly has a more informed opinion on this than i. - erik