From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 5882 invoked from network); 5 Oct 2004 11:03:26 -0000 Received: from news.dotsrc.org (HELO a.mx.sunsite.dk) (130.225.247.88) by ns1.primenet.com.au with SMTP; 5 Oct 2004 11:03:26 -0000 Received: (qmail 28314 invoked from network); 5 Oct 2004 11:03:19 -0000 Received: from sunsite.dk (130.225.247.90) by a.mx.sunsite.dk with SMTP; 5 Oct 2004 11:03:19 -0000 Received: (qmail 6012 invoked by alias); 5 Oct 2004 11:03:02 -0000 Mailing-List: contact zsh-workers-help@sunsite.dk; run by ezmlm Precedence: bulk X-No-Archive: yes X-Seq: 20453 Received: (qmail 5988 invoked from network); 5 Oct 2004 11:02:59 -0000 Received: from unknown (HELO a.mx.sunsite.dk) (130.225.247.88) by sunsite.dk with SMTP; 5 Oct 2004 11:02:59 -0000 Received: (qmail 27236 invoked from network); 5 Oct 2004 11:02:00 -0000 Received: from mail36.messagelabs.com (193.109.254.211) by a.mx.sunsite.dk with SMTP; 5 Oct 2004 11:01:54 -0000 X-VirusChecked: Checked X-Env-Sender: okiddle@yahoo.co.uk X-Msg-Ref: server-14.tower-36.messagelabs.com!1096974113!9911170 X-StarScan-Version: 5.2.10; banners=-,-,- X-Originating-IP: [158.234.9.163] Received: (qmail 27929 invoked from network); 5 Oct 2004 11:01:53 -0000 Received: from iris.logica.co.uk (158.234.9.163) by server-14.tower-36.messagelabs.com with SMTP; 5 Oct 2004 11:01:53 -0000 Received: from trentino.logica.co.uk ([158.234.142.61]) by iris.logica.co.uk (8.12.3/8.12.3/Debian -4) with ESMTP id i95B1rAI018034 for ; Tue, 5 Oct 2004 12:01:53 +0100 Received: from trentino.logica.co.uk (localhost [127.0.0.1]) by trentino.logica.co.uk (Postfix) with ESMTP id EB9FFF374 for ; Tue, 5 Oct 2004 13:01:32 +0200 (CEST) X-VirusChecked: Checked X-StarScan-Version: 5.0.7; banners=.,-,- In-reply-to: <200410041620.i94GKNro006000@news01.csr.com> From: Oliver Kiddle References: <20041001184122.GA9094@fargo> <23473.1096659965@trentino.logica.co.uk> <200410041620.i94GKNro006000@news01.csr.com> To: Zsh-workers Subject: Re: UTF-8 support Date: Tue, 05 Oct 2004 13:01:32 +0200 Message-ID: <29214.1096974092@trentino.logica.co.uk> X-Spam-Checker-Version: SpamAssassin 2.63 on a.mx.sunsite.dk X-Spam-Level: X-Spam-Status: No, hits=0.0 required=6.0 tests=none autolearn=no version=2.63 X-Spam-Hits: 0.0 Peter wrote: > I came to the conclusion that was going to be very time consuming --- it > means unmetafying potentially a long string (we don't know where the > characters end) and calling a function every time we want to compare multibyte > characters. Doing it only for UTF-8 can be optimised to work with > extensions to the current tests; it's simple to test for the length of a > UTF-8 character (although some error checking is also necessary). If you want to find a short string in a long string you can surely metafy the short string instead of unmetafying the long string. The approach I was suggesting has the big advantage that we can add support in isolated areas without first breaking the entire shell. I think it would be bad a mistake to rewrite our own, UTF-8 specific versions of all the routines that libc already provides. Even if we can make one or two slightly more efficient by handling the meta process at the same time. And if we're going to restrict the code to UTF-8, we could ditch the meta stuff and use overcoding. This amounts to storing the null character as the overlong two byte sequence c080. The code for that would be a lot simpler but you can't expect to pass overlong sequences elsewhere without getting errors. At least UTF-8 allows you to strchr for 7-bit ASCII characters in a UTF-8 string (other multi-byte encodings allow this only for /). Can we perhaps change the Meta character to 0xc0. We can then use overcoding for UTF-8 but make the UTF-8 specific code much more minimal in the metafy process. The most efficient way would be to maintain string lengths, Pascal style (length in bytes not characters). Possibly even using wchar_t instead of multi-byte encodings. We could perhaps do that for limited sections of code such as parameters. That would cope better when someone decides to change the current locale. If we extend that elsewhere, we need to be careful if we want to maintain portability of word code files, however. > Given that the whole point of Unicode is to replace all other schemes, > I'm not so keen about supporting other schemes if it's that much less > efficient. I'm not suggesting supporting alternatives to Unicode but alternatives to UTF-8. I'd bet that single-byte 8-bit encodings will stick around on small or embedded systems for longer than you might expect. My main objection is to any suggestion of not using library calls to handle the work. mblen may be easy to reimplement but wcwidth is not so we'd end up with a mixture. I don't mind so much whether we support other multibyte encodings with more limited ASCII compatibility than UTF-8. It'd be better to have limited support than an error message followed by setting LC_CTYPE to C, though. Oliver