From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 10306 invoked from network); 5 Oct 2004 11:33:32 -0000 Received: from news.dotsrc.org (HELO a.mx.sunsite.dk) (130.225.247.88) by ns1.primenet.com.au with SMTP; 5 Oct 2004 11:33:32 -0000 Received: (qmail 44826 invoked from network); 5 Oct 2004 11:33:26 -0000 Received: from sunsite.dk (130.225.247.90) by a.mx.sunsite.dk with SMTP; 5 Oct 2004 11:33:26 -0000 Received: (qmail 18767 invoked by alias); 5 Oct 2004 11:33:12 -0000 Mailing-List: contact zsh-workers-help@sunsite.dk; run by ezmlm Precedence: bulk X-No-Archive: yes X-Seq: 20455 Received: (qmail 18751 invoked from network); 5 Oct 2004 11:33:11 -0000 Received: from unknown (HELO a.mx.sunsite.dk) (130.225.247.88) by sunsite.dk with SMTP; 5 Oct 2004 11:33:11 -0000 Received: (qmail 44175 invoked from network); 5 Oct 2004 11:32:13 -0000 Received: from lhuumrelay3.lnd.ops.eu.uu.net (62.189.58.19) by a.mx.sunsite.dk with SMTP; 5 Oct 2004 11:32:11 -0000 Received: from MAILSWEEPER01.csr.com (mailhost1.csr.com [62.189.183.235]) by lhuumrelay3.lnd.ops.eu.uu.net (8.11.0/8.11.0) with ESMTP id i95BW7v10766 for ; Tue, 5 Oct 2004 11:32:08 GMT Received: from EXCHANGE02.csr.com (unverified [192.168.137.45]) by MAILSWEEPER01.csr.com (Content Technologies SMTPRS 4.3.12) with ESMTP id for ; Tue, 5 Oct 2004 12:31:05 +0100 Received: from news01.csr.com ([192.168.143.38]) by EXCHANGE02.csr.com with Microsoft SMTPSVC(5.0.2195.6713); Tue, 5 Oct 2004 12:34:13 +0100 Received: from news01.csr.com (localhost.localdomain [127.0.0.1]) by news01.csr.com (8.12.11/8.12.11) with ESMTP id i95BW3K2007203 for ; Tue, 5 Oct 2004 12:32:03 +0100 Received: from csr.com (pws@localhost) by news01.csr.com (8.12.11/8.12.11/Submit) with ESMTP id i95BW1qv007200 for ; Tue, 5 Oct 2004 12:32:03 +0100 Message-Id: <200410051132.i95BW1qv007200@news01.csr.com> X-Authentication-Warning: news01.csr.com: pws owned process doing -bs To: Zsh-workers Subject: Re: UTF-8 support In-reply-to: <29214.1096974092@trentino.logica.co.uk> References: <20041001184122.GA9094@fargo> <23473.1096659965@trentino.logica.co.uk> <200410041620.i94GKNro006000@news01.csr.com> <29214.1096974092@trentino.logica.co.uk> Date: Tue, 05 Oct 2004 12:32:01 +0100 From: Peter Stephenson X-OriginalArrivalTime: 05 Oct 2004 11:34:13.0837 (UTC) FILETIME=[3D679FD0:01C4AACF] X-Spam-Checker-Version: SpamAssassin 2.63 on a.mx.sunsite.dk X-Spam-Level: X-Spam-Status: No, hits=0.0 required=6.0 tests=none autolearn=no version=2.63 X-Spam-Hits: 0.0 Oliver Kiddle wrote: > If you want to find a short string in a long string you can surely > metafy the short string instead of unmetafying the long string. Both strings are likely to be metafied anyway, internally, but that doesn't help if you're using the library routines for comparisons, since they don't know about meta characters; and because you don't know where a character ends, you also don't know at what byte two characters differ without using library functions. Unless you guess where it ends you need the entire string from the first multibyte character in the representation used by the library. Indeed, unless we start with some assumption about the encoding we have to compare every single character with library functions on an unmetafied string. This is very messy if we have to support systems where the library functions aren't available (and we break quite a lot unless we do that). So, while I can't say for sure, I strongly suspect we're going to end up with having to make some of the assumptions which are already encoded into the library. Thus some kind of hybrid is forced on us for practical reasons. Given this, I suspect that assuming UTF-8 and avoiding the library functions where we don't need them is actually going to be the neatest. However, this remains to be seen. I can't see an advantage in assuming UTF-8 and then relying on the library for comparisons etc. This seems to give the worst of both worlds. > The approach I was suggesting has the big advantage that we can add > support in isolated areas without first breaking the entire shell. That can be done however we decide, at least if we keep the current Meta scheme. Indeed, that's probably the way to go; we can experiment with different methods locally before altering the rest of the shell. The pattern code is probably the most time-critical for comparing multibyte characters. Maybe this is a good time to look at removing the requirement for NULL-terminated strings after all. > mblen may be easy to reimplement but wcwidth is not so we'd end up > with a mixture. Yes, we certainly need library calls in zle. However, formatting strings for interactive output doesn't need to go particularly fast. As I said, I think that in practice we're stuck with a mixture anyway. pws ********************************************************************** This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This footnote also confirms that this email message has been swept by MIMEsweeper for the presence of computer viruses. www.mimesweeper.com **********************************************************************