From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: From: erik quanstrom Date: Sun, 26 Jul 2009 10:32:58 -0400 To: 9fans@9fans.net In-Reply-To: <14ec7b180907260041h18f63c64x871a7059cc9244bb@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Subject: Re: [9fans] Woes of New Language Support Topicbox-Message-UUID: 2e0b5c0a-ead5-11e9-9d60-3106f5b1d025 > the real problem isn't in viewing them however, but comes when you > start searching for them: it's easy to search for ë (e-umlaut) for > example, but what if it's described as e+"U+0308 COMBINING DIAERESIS"? > the answer is the UTS#18 Regular Expressions technical standard which > probably contributes at least half of the slowness of gnu grep > discussed in another thread. http://www.unicode.org/reports/tr18/ iirc, gnu grep calls malloc for each character of utf-8 input. awsome. at a minimum, it would be good to write to add support to tcs to translate to cannonical form utf. this would make the searching problem much easier. the unicode proposal says that matches depend on (re, locale, input). not just (re, input). i would think that is not acceptable. - erik