Re: Some groundwork for Unicode in Zle

zsh-workers
 help / color / mirror / code / Atom feed

From: Peter Stephenson <pws@csr.com>
To: zsh-workers@sunsite.dk (Zsh hackers list)
Subject: Re: Some groundwork for Unicode in Zle
Date: Tue, 11 Jan 2005 15:30:11 +0000	[thread overview]
Message-ID: <200501111530.j0BFUBjn014729@news01.csr.com> (raw)
In-Reply-To: <27571.1105455081@trentino.logica.co.uk>

> Peter wrote:
> > from how the line is encoded internally.  We can use wchar_t inside and
> > pass back a multibyte string.
> 
> Good to see this being addressed. How do you plan to cope with encoding
> nulls if you use wchar_t? (or does zle not bother?) The whole meta stuff
> is what really scared me off ever touching this.

Do you mean null-terminating a string?  I don't think we need that
inside ZLE, though it's easy to get into confusion with the conversions
needed for completion.  Or do you mean difficulties with L'\0'?

> > I've made a very dull patch that does a few things that might make adding
> > Unicode support to Zle easier.  Actually, I think within Zle it should
> > be easy to use generic wchar_t's and not worry about whether they're
> > really Unicode, but I still propose to rely on __STDC_ISO_10646__ to
> 
> Why? Relying on __STDC_ISO_10646__ will rule out a good number of
> systems that do otherwise have good support for multibyte encodings such
> as UTF-8. __STDC_ISO_10646__ is defined on surprisingly few systems. We
> really don't care about what wchar_t is internally if we let libc do our
> conversions.

This definition means we can use wchar_t in a way natural for Unicode
(yes, it doesn't matter if that really is Unicode or something else)
without worrying, and likewise there is compiler support.  For example,
it means L'\0' etc. works.

I'm really not interested in fudging round that sort of thing at this
stage, nor random #ifdef fudges.  Anything I do is likely to be on
Linux.  If someone else once to see what the effect of relaxing the test
is and how it can be fixed up, if necessary, I will be delighted.  For
me it's an unnecessary complication stopping me getting something going.

> > Before I get to details of what I've patched so far, one question: how
> > do we turn input into characters?  My first thought was to do it at a low
> > level around getkey, possibly in getkeybuf which already does
> 
> That would seem more sensible to me. Allowing partial multi-byte
> sequences to be bound is not very nice and probably not very useful.

I'm not sure I agree.  Suppose you have a Meta-sequence bound, i.e. a
binding for some ASCII character with the high bit set.  This conflicts
with a fully working UTF-8 input system, but you may well not have that,
just a system which happens to boot up with a UTF-8 locale.
Implementing only wchar_t based lookups will break this completely.
That seems pretty fatal to me.

> > The actual Unicode-related changes are minimal.  system.h shows how I
> 
> Did you mean to attach an actual patch?

No, it's huge with all the changes of names.

> > #if defined(HAVE_WCHAR_H) && defined(HAVE_WCTOMB) && defined (__STDC_ISO_10646__)
> > # include <wchar.h>
> 
> You'll probably want to include wchar.h even if __STDC_ISO_10646__ is
> not defined. For \u/\U, wchar_t was only useful when converting from
> unicode to wchar_t could be done trivially: when __STDC_ISO_10646__ is
> defined. It otherwise uses iconv or a hardcoded UTF-8 conversion. For
> zle, I can't think of any instance where you would care whether whar_t
> is unicode.

You may well be right, but again it's just another unnecessary
complication at this stage and I don't have a system to test the
effect.

> > /*
> >  * More stringent requirements to enable complete Unicode conversion
> >  * between wide characters and multibyte strings.
> >  */
> > #if defined(HAVE_MBTOWC)
> > /*#define ZLE_UNICODE_SUPPORT	1*/
> 
> I don't quite follow the logic of that check.
> 
> I wouldn't have thought ZLE_UNICODE_SUPPORT is a good name for the
> define. The requirement is to support multibyte character encodings, not
> specifically "unicode" and the same define will probably be extended to
> areas outside of zle. How about ENABLE_MULTIBYTE, perhaps linked to a
> configure --disable-multibyte option.

The *requirement* is to support Unicode, in particular UTF-8.  The
*hope* is to be able to allow other schemes without much work.  The
whole point of Unicode is to replace other schemes anyway, and they are
unlikely ever to be well-tested even if they work.

I'm strongly of the opinion we should stick with multibyte strings
outside Zle, and continue to have the interface to ZLE pass back such a
string.  I think the difficulties and costs of extending the use of
wchar_t outside ZLE would be prohibitive for very little gain.  So
this definition does apply only to ZLE, and (although this is not
necessarily all it does) is specifically targeted at making Unicode
schemes work.

> > typedef wchar_t *ZLE_STRING_T;
> > #else
> > typedef int ZLE_CHAR_T;
> 
> Why int and not unsigned char? Is it really worth having the separate
> STRING type? Again, I wouldn't use "ZLE" in the name given that we may
> want to use it outside zle someday.

This is uncontroversial; it's what we do at the moment.  Functions
returning a single character return an int and functions dealing with a
string use an unsigned char *.  The separate definitions wouldn't be
necessary if we just had wchar_t arrays, it's the backward compatibility
that makes it necessary.

As I said, I have no intention of ever using this scheme outside ZLE.

> > All the tests still pass, so I will commit this some time today.
> 
> Would it be worth creating a separate branch for multibyte support? It
> could later become 4.3. If so I'd suggest we continue to commit
> everything non-multibyte related to the current branch to avoid the old
> issue of the current release being very old.

(We'd probably do it the other way --- create a separate stable 4.2
branch so the new one was a mainline.)  It may become necessary, but
this will be just another way of slowing things down, so I'd like to
avoid it until things become too hairy.

It would be good at least to get 4.2.2 out of the way before anything
more than groundwork appears, though.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**********************************************************************

next prev parent reply	other threads:[~2005-01-11 15:30 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-01-11 13:52 Peter Stephenson
2005-01-11 13:59 ` Peter Stephenson
2005-01-11 15:11   ` DervishD
2005-01-14 13:10   ` Peter Stephenson
2005-01-15 17:35     ` Clint Adams
2005-01-15 19:28       ` Peter Stephenson
2005-01-17 10:59         ` NUL must be a first class citizen [was: Some groundwork for Unicode in Zle] Matthias B.
     [not found]           ` <msbREMOVE-THIS@winterdrache.de>
2005-01-17 11:19             ` Peter Stephenson
2005-01-11 14:51 ` Some groundwork for Unicode in Zle Oliver Kiddle
2005-01-11 14:54   ` Mads Martin Joergensen
2005-01-11 14:56     ` Mads Martin Joergensen
2005-01-11 15:30   ` Peter Stephenson [this message]
2005-01-11 16:31     ` Bart Schaefer
2005-01-11 15:09 ` DervishD
2005-01-11 16:27 ` Bart Schaefer
2005-01-11 16:35   ` Vin Shelton
2005-01-11 16:49   ` DervishD
2005-01-14 15:54 François-Xavier Coudert
2005-01-14 16:06 ` Clint Adams
2005-01-14 16:20   ` François-Xavier Coudert
2005-01-14 16:49 ` Peter Stephenson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200501111530.j0BFUBjn014729@news01.csr.com \
    --to=pws@csr.com \
    --cc=zsh-workers@sunsite.dk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).