9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: erik quanstrom <quanstro@quanstro.net>
To: paurea@gmail.com, 9fans@cse.psu.edu
Subject: Re: [9fans] awk, not utf aware...
Date: Wed, 27 Feb 2008 04:57:10 -0500	[thread overview]
Message-ID: <6ba21b46fa535aa3c7f5cd555d86f8be@quanstro.net> (raw)

> There is split and other functions,
> for example:
> 
> toupper("aí")
> gives
> Aí
> 
> My guess is that there are many more little (or not) corners where it
> doesn't work.
> We can go on and on looking for crevices and hiding the bugs further
> under the rug
> so that they are not evident and find everyone completely unaware,
> leave awk as it is now or really fix the problem. The first approach
> doesn't work. I am going to take
> the second till I have time to take the third which means use runes or
> at least revise all the
> code so that it is uniformly aware of the existance of non-ascii characters.

i don't understand this approach.  you propose redoing a fundamental
part of awk.   yet at the end you won't have solved the bug that's bothering
you.

ignoring the fact that awk is an ape program and doesn't use runes, the
problem with toupper is independent of the internal representation
of strings. as far as i can tell, ape doesn't even have towupper and towlower.

so if you provide those functions, fixing toupper and tolower could be
a 5 minute fix.  and you know you won't have broken anything else.

/sys/doc/utf.ps is worth a read.  it's not to hard to think of situations
that depend on character boundaries or operate on non-ascii characters.
generally there are few.  for example, rc only bothers with character
boundaries in matching. perhaps you could build a utf testsuite for awk.
make sure to use non-latin1 languages, too.

- erik


             reply	other threads:[~2008-02-27  9:57 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-27  9:57 erik quanstrom [this message]
  -- strict thread matches above, loose matches on Subject: below --
2008-02-28 18:54 Aharon Robbins
2008-02-28 21:48 ` Uriel
2008-02-28 22:08   ` erik quanstrom
2008-02-28 15:10 erik quanstrom
2008-03-03 23:48 ` Jack Johnson
2008-03-04  0:13   ` erik quanstrom
2008-02-26 12:18 Gorka Guardiola
2008-02-26 13:16 ` Martin Neubauer
2008-02-26 14:54   ` Gorka Guardiola
2008-02-26 20:24 ` erik quanstrom
2008-02-26 21:08   ` geoff
2008-02-26 21:21     ` Pietro Gagliardi
2008-02-26 21:24       ` erik quanstrom
2008-02-26 21:32       ` Steven Vormwald
2008-02-26 21:40         ` Pietro Gagliardi
2008-02-26 21:42           ` Pietro Gagliardi
2008-02-26 23:59           ` Steven Vormwald
2008-02-27  2:38       ` Joel C. Salomon
2008-02-29 17:00         ` Douglas A. Gwyn
2008-02-26 21:34     ` erik quanstrom
2008-02-27  7:36   ` Gorka Guardiola
2008-02-27 15:54     ` Sape Mullender
2008-02-27 20:01       ` Uriel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6ba21b46fa535aa3c7f5cd555d86f8be@quanstro.net \
    --to=quanstro@quanstro.net \
    --cc=9fans@cse.psu.edu \
    --cc=paurea@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).