9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: erik quanstrom <quanstro@quanstro.net>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] awk, not utf aware...
Date: Mon,  3 Mar 2008 19:13:48 -0500	[thread overview]
Message-ID: <5a7299b417393fc99fd462617a8a51c2@quanstro.net> (raw)
In-Reply-To: <6e35c0620803031548v583c051flb4d9ace76e220998@mail.gmail.com>

> > On the LINUX machines running utf-8 the ä is coded as $C3A4 which is
> > in utf-8 equal to the character E4. The ä occupies in that way 2 bytes.
> >
> > I was very astonished, when I copied a mac-filename, pasted into a
> > texteditor and looked at the file:
> >
> > In the mac-filename the letter ä is coded as: $61CC88, which in utf-8
> > means the letter "a" followed by a $0308. (Combining diacritical marks)
> > So the Mac combines the letter a with the two points above it instead
> > using the E4 letter
> > Now the things are clear: The filenames are different, in spite of
> > looking equally.
> 
> So, if folding codepoints is a reasonable tactic, how many
> representations do you need to fold?  How many binary representations
> are needed to fold íïìîi -> i?

i didn't make my point very well.  in this case i was suggesting a -f flag
for grep that would map a codepoints into their base codepoint.  the match
result would be the original text --- in the manner of the -i flag.

seperately, however ...

utf combining characters are a really unfortunate choice, imho.  there
is no limit to the number of combining codepoints one can add to
a base codepoint.  you can, for example build a single letter like this
	U+0061 U+0302 ... U+0302
i don't think it's possible to build legible glyphs from bitmaps using
combining diacriticals.

therefore, i would argue for reducing letters made up of base+combiners
to a precombined codepoint whenever possible.  it would be helpful
if tcs did this.  infortunately some transliterations of russian into the roman
alphabet use characters with no precombined form in unicode.

rob probablly has a more informed opinion on this than i.

- erik


  reply	other threads:[~2008-03-04  0:13 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-26 12:18 Gorka Guardiola
2008-02-26 13:16 ` Martin Neubauer
2008-02-26 14:54   ` Gorka Guardiola
2008-02-26 20:24 ` erik quanstrom
2008-02-26 21:08   ` geoff
2008-02-26 21:21     ` Pietro Gagliardi
2008-02-26 21:24       ` erik quanstrom
2008-02-26 21:32       ` Steven Vormwald
2008-02-26 21:40         ` Pietro Gagliardi
2008-02-26 21:42           ` Pietro Gagliardi
2008-02-26 23:59           ` Steven Vormwald
2008-02-27  2:38       ` Joel C. Salomon
2008-02-29 17:00         ` Douglas A. Gwyn
2008-02-26 21:34     ` erik quanstrom
2008-02-27  7:36   ` Gorka Guardiola
2008-02-27 15:54     ` Sape Mullender
2008-02-27 20:01       ` Uriel
2008-02-28 19:06         ` [9fans] localization, unicode, regexps (was: awk, not utf aware...) Tristan Plumb
2008-02-28 15:10       ` [9fans] awk, not utf aware erik quanstrom
2008-03-03 23:48         ` Jack Johnson
2008-03-04  0:13           ` erik quanstrom [this message]
2008-02-27  9:57 erik quanstrom
2008-02-28 18:54 Aharon Robbins
2008-02-28 21:48 ` Uriel
2008-02-28 22:08   ` erik quanstrom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5a7299b417393fc99fd462617a8a51c2@quanstro.net \
    --to=quanstro@quanstro.net \
    --cc=9fans@cse.psu.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).