From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <zsh-workers-return-20455-mason-zsh=primenet.com.au@sunsite.dk>
Received: (qmail 10306 invoked from network); 5 Oct 2004 11:33:32 -0000
Received: from news.dotsrc.org (HELO a.mx.sunsite.dk) (130.225.247.88)
  by ns1.primenet.com.au with SMTP; 5 Oct 2004 11:33:32 -0000
Received: (qmail 44826 invoked from network); 5 Oct 2004 11:33:26 -0000
Received: from sunsite.dk (130.225.247.90)
  by a.mx.sunsite.dk with SMTP; 5 Oct 2004 11:33:26 -0000
Received: (qmail 18767 invoked by alias); 5 Oct 2004 11:33:12 -0000
Mailing-List: contact zsh-workers-help@sunsite.dk; run by ezmlm
Precedence: bulk
X-No-Archive: yes
X-Seq: 20455
Received: (qmail 18751 invoked from network); 5 Oct 2004 11:33:11 -0000
Received: from unknown (HELO a.mx.sunsite.dk) (130.225.247.88)
  by sunsite.dk with SMTP; 5 Oct 2004 11:33:11 -0000
Received: (qmail 44175 invoked from network); 5 Oct 2004 11:32:13 -0000
Received: from lhuumrelay3.lnd.ops.eu.uu.net (62.189.58.19)
  by a.mx.sunsite.dk with SMTP; 5 Oct 2004 11:32:11 -0000
Received: from MAILSWEEPER01.csr.com (mailhost1.csr.com [62.189.183.235])
	by lhuumrelay3.lnd.ops.eu.uu.net (8.11.0/8.11.0) with ESMTP id i95BW7v10766
	for <zsh-workers@sunsite.dk>; Tue, 5 Oct 2004 11:32:08 GMT
Received: from EXCHANGE02.csr.com (unverified [192.168.137.45]) by MAILSWEEPER01.csr.com
 (Content Technologies SMTPRS 4.3.12) with ESMTP id <T6c757b0a35c0a88d013d0@MAILSWEEPER01.csr.com> for <zsh-workers@sunsite.dk>;
 Tue, 5 Oct 2004 12:31:05 +0100
Received: from news01.csr.com ([192.168.143.38]) by EXCHANGE02.csr.com with Microsoft SMTPSVC(5.0.2195.6713);
	 Tue, 5 Oct 2004 12:34:13 +0100
Received: from news01.csr.com (localhost.localdomain [127.0.0.1])
	by news01.csr.com (8.12.11/8.12.11) with ESMTP id i95BW3K2007203
	for <zsh-workers@sunsite.dk>; Tue, 5 Oct 2004 12:32:03 +0100
Received: from csr.com (pws@localhost)
	by news01.csr.com (8.12.11/8.12.11/Submit) with ESMTP id i95BW1qv007200
	for <zsh-workers@sunsite.dk>; Tue, 5 Oct 2004 12:32:03 +0100
Message-Id: <200410051132.i95BW1qv007200@news01.csr.com>
X-Authentication-Warning: news01.csr.com: pws owned process doing -bs
To: Zsh-workers <zsh-workers@sunsite.dk>
Subject: Re: UTF-8 support 
In-reply-to: <29214.1096974092@trentino.logica.co.uk> 
References: <20041001184122.GA9094@fargo> <23473.1096659965@trentino.logica.co.uk> <200410041620.i94GKNro006000@news01.csr.com> <29214.1096974092@trentino.logica.co.uk>
Date: Tue, 05 Oct 2004 12:32:01 +0100
From: Peter Stephenson <pws@csr.com>
X-OriginalArrivalTime: 05 Oct 2004 11:34:13.0837 (UTC) FILETIME=[3D679FD0:01C4AACF]
X-Spam-Checker-Version: SpamAssassin 2.63 on a.mx.sunsite.dk
X-Spam-Level: 
X-Spam-Status: No, hits=0.0 required=6.0 tests=none autolearn=no version=2.63
X-Spam-Hits: 0.0

Oliver Kiddle wrote:
> If you want to find a short string in a long string you can surely
> metafy the short string instead of unmetafying the long string.

Both strings are likely to be metafied anyway, internally, but that
doesn't help if you're using the library routines for comparisons, since
they don't know about meta characters; and because you don't know where
a character ends, you also don't know at what byte two characters differ
without using library functions.  Unless you guess where it ends you
need the entire string from the first multibyte character in the
representation used by the library.

Indeed, unless we start with some assumption about the encoding we have
to compare every single character with library functions on an
unmetafied string.  This is very messy if we have to support systems
where the library functions aren't available (and we break quite a lot
unless we do that).  So, while I can't say for sure, I strongly suspect
we're going to end up with having to make some of the assumptions which
are already encoded into the library.  Thus some kind of hybrid is
forced on us for practical reasons.  Given this, I suspect that assuming
UTF-8 and avoiding the library functions where we don't need them is
actually going to be the neatest.  However, this remains to be seen.

I can't see an advantage in assuming UTF-8 and then relying on the
library for comparisons etc.  This seems to give the worst of both
worlds.

> The approach I was suggesting has the big advantage that we can add
> support in isolated areas without first breaking the entire shell.

That can be done however we decide, at least if we keep the current Meta
scheme.  Indeed, that's probably the way to go; we can experiment with
different methods locally before altering the rest of the shell.  The
pattern code is probably the most time-critical for comparing multibyte
characters.  Maybe this is a good time to look at removing the
requirement for NULL-terminated strings after all.

> mblen may be easy to reimplement but wcwidth is not so we'd end up
> with a mixture.

Yes, we certainly need library calls in zle.  However, formatting
strings for interactive output doesn't need to go particularly fast.
As I said, I think that in practice we're stuck with a mixture anyway.

pws


**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**********************************************************************