From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <zsh-workers-return-20682-mason-zsh=primenet.com.au@sunsite.dk>
Received: (qmail 23675 invoked from network); 11 Jan 2005 15:30:27 -0000
Received: from news.dotsrc.org (HELO a.mx.sunsite.dk) (130.225.247.88)
  by ns1.primenet.com.au with SMTP; 11 Jan 2005 15:30:27 -0000
Received: (qmail 75694 invoked from network); 11 Jan 2005 15:30:21 -0000
Received: from sunsite.dk (130.225.247.90)
  by a.mx.sunsite.dk with SMTP; 11 Jan 2005 15:30:21 -0000
Received: (qmail 809 invoked by alias); 11 Jan 2005 15:30:17 -0000
Mailing-List: contact zsh-workers-help@sunsite.dk; run by ezmlm
Precedence: bulk
X-No-Archive: yes
X-Seq: 20682
Received: (qmail 792 invoked from network); 11 Jan 2005 15:30:16 -0000
Received: from news.dotsrc.org (HELO a.mx.sunsite.dk) (130.225.247.88)
  by sunsite.dk with SMTP; 11 Jan 2005 15:30:16 -0000
Received: (qmail 75378 invoked from network); 11 Jan 2005 15:30:16 -0000
Received: from mailhost1.csr.com (HELO MAILSWEEPER01.csr.com) (81.105.217.43)
  by a.mx.sunsite.dk with SMTP; 11 Jan 2005 15:30:13 -0000
Received: from exchange03.csr.com (unverified [10.100.137.60]) by MAILSWEEPER01.csr.com
 (Content Technologies SMTPRS 4.3.12) with ESMTP id <T6e6ecd0a630a6c8d012dc@MAILSWEEPER01.csr.com> for <zsh-workers@sunsite.dk>;
 Tue, 11 Jan 2005 15:28:53 +0000
Received: from news01.csr.com ([10.103.143.38]) by exchange03.csr.com with Microsoft SMTPSVC(5.0.2195.6713);
	 Tue, 11 Jan 2005 15:32:24 +0000
Received: from news01.csr.com (localhost.localdomain [127.0.0.1])
	by news01.csr.com (8.13.1/8.12.11) with ESMTP id j0BFUBD9014732
	for <zsh-workers@sunsite.dk>; Tue, 11 Jan 2005 15:30:12 GMT
Received: from csr.com (pws@localhost)
	by news01.csr.com (8.13.1/8.13.1/Submit) with ESMTP id j0BFUBjn014729
	for <zsh-workers@sunsite.dk>; Tue, 11 Jan 2005 15:30:11 GMT
Message-Id: <200501111530.j0BFUBjn014729@news01.csr.com>
X-Authentication-Warning: news01.csr.com: pws owned process doing -bs
To: zsh-workers@sunsite.dk (Zsh hackers list)
Subject: Re: Some groundwork for Unicode in Zle 
In-reply-to: <27571.1105455081@trentino.logica.co.uk> 
References: <200501111352.j0BDqCKs001801@news01.csr.com> <27571.1105455081@trentino.logica.co.uk>
Date: Tue, 11 Jan 2005 15:30:11 +0000
From: Peter Stephenson <pws@csr.com>
X-OriginalArrivalTime: 11 Jan 2005 15:32:24.0717 (UTC) FILETIME=[BFEA5BD0:01C4F7F2]
X-Spam-Checker-Version: SpamAssassin 2.63 on a.mx.sunsite.dk
X-Spam-Level: 
X-Spam-Status: No, hits=0.0 required=6.0 tests=none autolearn=no version=2.63
X-Spam-Hits: 0.0

> Peter wrote:
> > from how the line is encoded internally.  We can use wchar_t inside and
> > pass back a multibyte string.
> 
> Good to see this being addressed. How do you plan to cope with encoding
> nulls if you use wchar_t? (or does zle not bother?) The whole meta stuff
> is what really scared me off ever touching this.

Do you mean null-terminating a string?  I don't think we need that
inside ZLE, though it's easy to get into confusion with the conversions
needed for completion.  Or do you mean difficulties with L'\0'?

> > I've made a very dull patch that does a few things that might make adding
> > Unicode support to Zle easier.  Actually, I think within Zle it should
> > be easy to use generic wchar_t's and not worry about whether they're
> > really Unicode, but I still propose to rely on __STDC_ISO_10646__ to
> 
> Why? Relying on __STDC_ISO_10646__ will rule out a good number of
> systems that do otherwise have good support for multibyte encodings such
> as UTF-8. __STDC_ISO_10646__ is defined on surprisingly few systems. We
> really don't care about what wchar_t is internally if we let libc do our
> conversions.

This definition means we can use wchar_t in a way natural for Unicode
(yes, it doesn't matter if that really is Unicode or something else)
without worrying, and likewise there is compiler support.  For example,
it means L'\0' etc. works.

I'm really not interested in fudging round that sort of thing at this
stage, nor random #ifdef fudges.  Anything I do is likely to be on
Linux.  If someone else once to see what the effect of relaxing the test
is and how it can be fixed up, if necessary, I will be delighted.  For
me it's an unnecessary complication stopping me getting something going.

> > Before I get to details of what I've patched so far, one question: how
> > do we turn input into characters?  My first thought was to do it at a low
> > level around getkey, possibly in getkeybuf which already does
> 
> That would seem more sensible to me. Allowing partial multi-byte
> sequences to be bound is not very nice and probably not very useful.

I'm not sure I agree.  Suppose you have a Meta-sequence bound, i.e. a
binding for some ASCII character with the high bit set.  This conflicts
with a fully working UTF-8 input system, but you may well not have that,
just a system which happens to boot up with a UTF-8 locale.
Implementing only wchar_t based lookups will break this completely.
That seems pretty fatal to me.

> > The actual Unicode-related changes are minimal.  system.h shows how I
> 
> Did you mean to attach an actual patch?

No, it's huge with all the changes of names.

> > #if defined(HAVE_WCHAR_H) && defined(HAVE_WCTOMB) && defined (__STDC_ISO_10646__)
> > # include <wchar.h>
> 
> You'll probably want to include wchar.h even if __STDC_ISO_10646__ is
> not defined. For \u/\U, wchar_t was only useful when converting from
> unicode to wchar_t could be done trivially: when __STDC_ISO_10646__ is
> defined. It otherwise uses iconv or a hardcoded UTF-8 conversion. For
> zle, I can't think of any instance where you would care whether whar_t
> is unicode.

You may well be right, but again it's just another unnecessary
complication at this stage and I don't have a system to test the
effect.

> > /*
> >  * More stringent requirements to enable complete Unicode conversion
> >  * between wide characters and multibyte strings.
> >  */
> > #if defined(HAVE_MBTOWC)
> > /*#define ZLE_UNICODE_SUPPORT	1*/
> 
> I don't quite follow the logic of that check.
> 
> I wouldn't have thought ZLE_UNICODE_SUPPORT is a good name for the
> define. The requirement is to support multibyte character encodings, not
> specifically "unicode" and the same define will probably be extended to
> areas outside of zle. How about ENABLE_MULTIBYTE, perhaps linked to a
> configure --disable-multibyte option.

The *requirement* is to support Unicode, in particular UTF-8.  The
*hope* is to be able to allow other schemes without much work.  The
whole point of Unicode is to replace other schemes anyway, and they are
unlikely ever to be well-tested even if they work.

I'm strongly of the opinion we should stick with multibyte strings
outside Zle, and continue to have the interface to ZLE pass back such a
string.  I think the difficulties and costs of extending the use of
wchar_t outside ZLE would be prohibitive for very little gain.  So
this definition does apply only to ZLE, and (although this is not
necessarily all it does) is specifically targeted at making Unicode
schemes work.

> > typedef wchar_t *ZLE_STRING_T;
> > #else
> > typedef int ZLE_CHAR_T;
> 
> Why int and not unsigned char? Is it really worth having the separate
> STRING type? Again, I wouldn't use "ZLE" in the name given that we may
> want to use it outside zle someday.

This is uncontroversial; it's what we do at the moment.  Functions
returning a single character return an int and functions dealing with a
string use an unsigned char *.  The separate definitions wouldn't be
necessary if we just had wchar_t arrays, it's the backward compatibility
that makes it necessary.

As I said, I have no intention of ever using this scheme outside ZLE.

> > All the tests still pass, so I will commit this some time today.
> 
> Would it be worth creating a separate branch for multibyte support? It
> could later become 4.3. If so I'd suggest we continue to commit
> everything non-multibyte related to the current branch to avoid the old
> issue of the current release being very old.

(We'd probably do it the other way --- create a separate stable 4.2
branch so the new one was a mainline.)  It may become necessary, but
this will be just another way of slowing things down, so I'd like to
avoid it until things become too hairy.

It would be good at least to get 4.2.2 out of the way before anything
more than groundwork appears, though.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**********************************************************************