9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: Tristan Plumb <9p-st@imu.li>
To: 9fans@cse.psu.edu
Subject: [9fans] localization, unicode, regexps (was: awk, not utf aware...)
Date: Thu, 28 Feb 2008 14:06:57 -0500	[thread overview]
Message-ID: <2573fd.277c0a14.xzi-bfuOlmWG.3397.mx@tulgey.imu.li> (raw)
In-Reply-To: <39d22acfc53470335fdb74156c738feb@plan9.bell-labs.com> <6e3f8d391c4219c9b0acb323d74603d6@quanstro.net> <5d375e920802271201n263476cbnf8929a92c1ba3177@mail.gmail.com>

> erik | Sape * uriel

I have been pondering character sets rather alot recently (mostly wishful
thinking, by my estimation), so this conversation set me thinking more...

> how does one deal with a multi-language file.
By not dealing in languages? Unicode (however flawed) solves multi-script
files, why mire ourselves in mutable (scripts are plenty) language rules?

> for example, have a base-character folding switch that allows regexps
> to fold codpoints into base codepoints so that íïìîi -> i.
I would favor decomposing codepoints (í→í, ï→ï, ì→ì, î→î) with the switch
to ignore combining characters, that has the disadvantage of lengthening,
by a byte or rune a time, your text, but does allow you to match accents.

| Yes, and then there is locale: does [a-z] include ij when you run it
| in Holland (it should)?  Does it include á, è, ô in France (it should)?
| Does it include ø, å in Norway (it should not)?  And what happens when
| you evaluate "è"< "o" (it depends)?
Does spanish [a-c] match the c in ch (depends on when and where you ask)?
More Unicode-centric, does 'a' match (the first byte of) 'à' (U0061+0300)
(or all three bytes, or not at all)?

I would write [a-z] in a regexp upon two occations, a letter of the latin
alphabet (better served by something like [[:latin:]] (so I needent add a
bunch of other things ([þðæœø]))) or the bytes [61, 7a]. As any sort of a
public project is stuck with Unicode (not advocating the hysteria before,
just wishing Unicode left some of it behind), regexps reflecting Unicode,
not the user's language, makes sense to me. Unicode is at least codified.

* I think the plan9 tools demonstrate that it is not so hard to find a
* 'good enough' solution; and the lunix locale debacle demonstrate that
* if you want to get it 'right' you will end up with a nightmare.
Yet some things that are good enough (I'll pick on Unicode) for one idea,
lumping character sets together does a fine job to write multiple scripts
in the same file, spawns nightmares, ǭ = ǭ = ǭ = ǭ = ǭ, good enough being
ill-thought-out. Yet mayhap you mean well-compromised (that seems right).

To those who were at IWP9 this year: Cast your mind back to a question of
plan9 people with vested intrest for RtL rendering and the like. I should
have stood up then and cried out, I! Imagine either I did so or I do now.

If anyone has interest in playing on this at a character set level, tell?

enjoy,
tristan

-- 
All original matter is hereby placed immediately under the public domain.


  reply	other threads:[~2008-02-28 19:06 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-26 12:18 [9fans] awk, not utf aware Gorka Guardiola
2008-02-26 13:16 ` Martin Neubauer
2008-02-26 14:54   ` Gorka Guardiola
2008-02-26 20:24 ` erik quanstrom
2008-02-26 21:08   ` geoff
2008-02-26 21:21     ` Pietro Gagliardi
2008-02-26 21:24       ` erik quanstrom
2008-02-26 21:32       ` Steven Vormwald
2008-02-26 21:40         ` Pietro Gagliardi
2008-02-26 21:42           ` Pietro Gagliardi
2008-02-26 23:59           ` Steven Vormwald
2008-02-27  2:38       ` Joel C. Salomon
2008-02-29 17:00         ` Douglas A. Gwyn
2008-02-26 21:34     ` erik quanstrom
2008-02-27  7:36   ` Gorka Guardiola
2008-02-27 15:54     ` Sape Mullender
2008-02-27 20:01       ` Uriel
2008-02-28 19:06         ` Tristan Plumb [this message]
2008-02-28 15:10       ` erik quanstrom
2008-03-03 23:48         ` Jack Johnson
2008-03-04  0:13           ` erik quanstrom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2573fd.277c0a14.xzi-bfuOlmWG.3397.mx@tulgey.imu.li \
    --to=9p-st@imu.li \
    --cc=9fans@cse.psu.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).