From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <c8dc77ef829fda94952fe2504fc04516@quanstro.net>
From: erik quanstrom <quanstro@quanstro.net>
Date: Sun, 26 Jul 2009 10:32:58 -0400
To: 9fans@9fans.net
In-Reply-To: <14ec7b180907260041h18f63c64x871a7059cc9244bb@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Subject: Re: [9fans] Woes of New Language Support
Topicbox-Message-UUID: 2e0b5c0a-ead5-11e9-9d60-3106f5b1d025

> the real problem isn't in viewing them however, but comes when you
> start searching for them: it's easy to search for ë (e-umlaut) for
> example, but what if it's described as e+"U+0308 COMBINING DIAERESIS"?
> the answer is the UTS#18 Regular Expressions technical standard which
> probably contributes at least half of the slowness of gnu grep
> discussed in another thread. http://www.unicode.org/reports/tr18/

iirc, gnu grep calls malloc for each character of utf-8 input.  awsome.

at a minimum, it would be good to write to add support to tcs
to translate to cannonical form utf.  this would make the
searching problem much easier.

the unicode proposal says that matches depend on (re, locale, input).
not just (re, input).  i would think that is not acceptable.

- erik