Re: Some groundwork for Unicode in Zle

zsh-workers
 help / color / mirror / code / Atom feed

* Re: Some groundwork for Unicode in Zle
@ 2005-01-14 15:54 François-Xavier Coudert
  2005-01-14 16:06 ` Clint Adams
  2005-01-14 16:49 ` Peter Stephenson
  0 siblings, 2 replies; 19+ messages in thread
From: François-Xavier Coudert @ 2005-01-14 15:54 UTC (permalink / raw)
  To: zsh-workers

Hi all,

I'm new to the list but I'm interested in UTF-8 inclusion into Zle. My
question is the following: have you considered the possibility of keeping
storing strings like the line edited in arrays of char (and not wide
chars), while using a few functions to handle the fact that one Unicode
character may be represented by a few chars (and one glyph by a few
Unicode characters, but I'm not sure how this can be handled).

Using a few of the functions glib exports for Unicode (but zsh could use
home-made functions if need be), I hacked (and that's nothing close to
pretty) some internal of Zle in the following way:

diff -r zsh-4.2.3/Src/Zle/zle_misc.c zsh-fx/Src/Zle/zle_misc.c
29a30
> #include <glib.h>
97,98c98,99
<       cs += zmult;
<       backdel(zmult);
---
>       cs = (char *) (g_utf8_next_char (line + cs)) - (char *)line;
>       backdel(((char *) line + cs) - (char *)g_utf8_prev_char (line +
>       cs));
114a116,119
>     if (zmult > cs)
>       backdel (cs);
>     else
>       backdel(((char *) line + cs) - (char *)g_utf8_prev_char (line +
>       cs) - 1);

diff -r zsh-4.2.3/Src/Zle/zle_move.c zsh-fx/Src/Zle/zle_move.c
29c29
< 
---
> #include "glib.h"
162c162,167
<     cs += zmult;
---
>     cs = (char *) (g_utf8_next_char (line + cs)) - (char *)line;
174c179
<     cs -= zmult;
---
>     cs = (char *) (g_utf8_prev_char (line + cs)) - (char *)line;

diff -r zsh-4.2.3/Src/Zle/zle_utils.c zsh-fx/Src/Zle/zle_utils.c
29a30
> #include <glib.h>
94a96,97
>     int next, i;
>     
101,102c104,107
<       line[to] = line[to + cnt];
<       to++;
---
>         next = (char *) (g_utf8_next_char (line + cnt)) - (char *)line
>         - cnt;
>       for (i = to; i < to + next; i++)
>         line[i] = line[i + cnt];
>       to += next;

With this, one can correctly move around and delete (fore and back)
unicode characters with ease. Such modifications seem easy to generalize.
So the points I'd like to get your thoughts on are:

  1. is such an approach useful?
  2. what are the arguments against it? (it may need a wider rewrite of
some builtins that other approaches)

Thanks for your attention, and I hope I will be able to help getting zsh
much more viable on UTF-8 systems!

FX

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Some groundwork for Unicode in Zle
  2005-01-14 15:54 Some groundwork for Unicode in Zle François-Xavier Coudert
@ 2005-01-14 16:06 ` Clint Adams
  2005-01-14 16:20   ` François-Xavier Coudert
  2005-01-14 16:49 ` Peter Stephenson
  1 sibling, 1 reply; 19+ messages in thread
From: Clint Adams @ 2005-01-14 16:06 UTC (permalink / raw)
  To: François-Xavier Coudert; +Cc: zsh-workers

> I'm new to the list but I'm interested in UTF-8 inclusion into Zle. My
> question is the following: have you considered the possibility of keeping
> storing strings like the line edited in arrays of char (and not wide
> chars), while using a few functions to handle the fact that one Unicode
> character may be represented by a few chars (and one glyph by a few
> Unicode characters, but I'm not sure how this can be handled).

See http://www.zsh.org/mla/workers/2001/msg02753.html for some
discussion about this.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Some groundwork for Unicode in Zle
  2005-01-14 16:06 ` Clint Adams
@ 2005-01-14 16:20   ` François-Xavier Coudert
  0 siblings, 0 replies; 19+ messages in thread
From: François-Xavier Coudert @ 2005-01-14 16:20 UTC (permalink / raw)
  To: Clint Adams; +Cc: zsh-workers

> See http://www.zsh.org/mla/workers/2001/msg02753.html for some
> discussion about this.

Thanks for the pointer. It seems things as evolved from that time.
Nowadays, UTF-8 is becoming the main locale for systems, and that is
causing pain when you use zsh (I know it, for personnal experience and
since I get lots of complains from students who don't understand what's
happening on their command-line). Furthermore, I think having to replace
all those "cs++" by use of a correct function is not a quick hack: it's
something that is correct in long term (we can tell the function to do
whatever we want, from a simple "++" to more sophisticated things).

Their is one argument, though, that I don't understand: why could't the
ZLE refresh code (what is quite a piece of hack in itself!) handle
passing multiple bites at a time? That's clearly not something in my
competences, so I'll happily trust gurus on this, but nevertheless I'd
like to understand why the refresh code couldn't be modified to pass
groups of bytes insted of bytes.

-- 
FX

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Some groundwork for Unicode in Zle
  2005-01-14 15:54 Some groundwork for Unicode in Zle François-Xavier Coudert
  2005-01-14 16:06 ` Clint Adams
@ 2005-01-14 16:49 ` Peter Stephenson
  1 sibling, 0 replies; 19+ messages in thread
From: Peter Stephenson @ 2005-01-14 16:49 UTC (permalink / raw)
  To: zsh-workers

=?iso-8859-1?Q?Fran=E7ois-Xavier?= Coudert wrote:
> Hi all,
> 
> I'm new to the list but I'm interested in UTF-8 inclusion into Zle. My
> question is the following: have you considered the possibility of keeping
> storing strings like the line edited in arrays of char (and not wide
> chars), while using a few functions to handle the fact that one Unicode
> character may be represented by a few chars (and one glyph by a few
> Unicode characters, but I'm not sure how this can be handled).

This works well inside the main shell --- we use something similar with
a "Meta" character to quote characters special to the shell.  This will
probably continue.

In the line editor, being able to access a character by index is fairly
fundamental.  Having to count every time we need to access a location
would be a big change.  It's better to have an array of the right type.
In other places we will need to pass a wchar_t or even a wint_t, but
this is better than having a multibyte string for every character.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**********************************************************************

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Some groundwork for Unicode in Zle
  2005-01-15 17:35     ` Clint Adams
@ 2005-01-15 19:28       ` Peter Stephenson
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Stephenson @ 2005-01-15 19:28 UTC (permalink / raw)
  To: Zsh hackers list

Clint Adams wrote:
> > Trapping and converting all direct or indirect uses of zlecs (cursor
> > position) and zlell (line length) outside the core zle code to use the
> > correct position with a multibyte string is likely to be a nightmare.
> 
> I wonder if moving the zlecs-/zlell-using functions from Src/hist.c and
> Src/lex.c to Src/Zle would be a good idea, since they seem to be
> operating directly on the input buffer.

That's a long term goal.  I'd like to have separate variables within the
main shell (and probably the completion system, though that's murkier)
so that zlecs and zlell are only ever used when referring directly to
zleline.  The idea is that when converting the line to a multibyte
string we also convert the character positions to refer to the position
in that rather than the original wide character array.

For now, the problem is that if we move them into zle, we need hooks to
get at them from the main shell when appropriate.  That's not worth
doing if they're going to disappear from the main shell eventually.

I had a brief go at replacing zlecs and zlell when used for completion
and it blew up.  I think the difficulties were actually in the
completion functions that set up and use the lexical analyser, however.
It wasn't clear when to convert and when to restore.  Any help there
would be appreciated.

-- 
Peter Stephenson <pws@pwstephenson.fsnet.co.uk>
Work: pws@csr.com
Web: http://www.pwstephenson.fsnet.co.uk

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Some groundwork for Unicode in Zle
  2005-01-14 13:10   ` Peter Stephenson
@ 2005-01-15 17:35     ` Clint Adams
  2005-01-15 19:28       ` Peter Stephenson
  0 siblings, 1 reply; 19+ messages in thread
From: Clint Adams @ 2005-01-15 17:35 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

> Trapping and converting all direct or indirect uses of zlecs (cursor
> position) and zlell (line length) outside the core zle code to use the
> correct position with a multibyte string is likely to be a nightmare.

I wonder if moving the zlecs-/zlell-using functions from Src/hist.c and
Src/lex.c to Src/Zle would be a good idea, since they seem to be
operating directly on the input buffer.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Some groundwork for Unicode in Zle
  2005-01-11 13:59 ` Peter Stephenson
  2005-01-11 15:11   ` DervishD
@ 2005-01-14 13:10   ` Peter Stephenson
  2005-01-15 17:35     ` Clint Adams
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Stephenson @ 2005-01-14 13:10 UTC (permalink / raw)
  To: Zsh hackers list

I have committed the patch to lay some groundwork for Unicode, and
updated the version in the archive to 4.2.3-dev-1.

To recap, I'm hoping it will be possible to change the zle line
variable, now zleline, to use wchar_t, while keeping a multibyte string
in the main shell, and probably completion for now (since completion
interacts with the main shell).  Any relaxing of assumptions about the
environment can happen once this is basically working.

It should now be possible to locate all uses of the key zle variables
zleline, zlecs, zlell by name.  Earlier versions of the patch
incorrectly renamed other uses of ll and line.  I think I've caught all
these, but the completion code used a large number of internal variables
with the same names.

Trapping and converting all direct or indirect uses of zlecs (cursor
position) and zlell (line length) outside the core zle code to use the
correct position with a multibyte string is likely to be a nightmare.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**********************************************************************

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Some groundwork for Unicode in Zle
  2005-01-11 16:27 ` Bart Schaefer
  2005-01-11 16:35   ` Vin Shelton
@ 2005-01-11 16:49   ` DervishD
  1 sibling, 0 replies; 19+ messages in thread
From: DervishD @ 2005-01-11 16:49 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

    Hi Bart :)

 * Bart Schaefer <schaefer@brasslantern.com> dixit:
> On Jan 11,  1:52pm, Peter Stephenson wrote:
> } All the tests still pass, so I will commit this some time today.
> If you haven't committed already, might I suggest doing a 4.2.2 release
> first?  It's been 4 months and a lot of patches since 4.2.1, and this
> is a potentially very lengthy effort to embark upon.

    Yespleaseyespleaseyesplease... I'm wishing to upgrade my 4.0.9
and doing it to 4.2.1 when 4.2.2 has enough patches in to be worth
the new version seems useless...

    Thanks for the suggestion, Bart :)

    Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/
It's my PC and I'll cry if I want to...


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Some groundwork for Unicode in Zle
  2005-01-11 16:27 ` Bart Schaefer
@ 2005-01-11 16:35   ` Vin Shelton
  2005-01-11 16:49   ` DervishD
  1 sibling, 0 replies; 19+ messages in thread
From: Vin Shelton @ 2005-01-11 16:35 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

Bart Schaefer <schaefer@brasslantern.com> writes:

> If you haven't committed already, might I suggest doing a 4.2.2 release
> first?  It's been 4 months and a lot of patches since 4.2.1, and this
> is a potentially very lengthy effort to embark upon.
>

Bart,

I think that's a very sensible suggestion.

FWIW, just $.02 from a lurker.

  - vin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Some groundwork for Unicode in Zle
  2005-01-11 15:30   ` Peter Stephenson
@ 2005-01-11 16:31     ` Bart Schaefer
  0 siblings, 0 replies; 19+ messages in thread
From: Bart Schaefer @ 2005-01-11 16:31 UTC (permalink / raw)
  To: Zsh hackers list

On Jan 11,  3:30pm, Peter Stephenson wrote:
} Subject: Re: Some groundwork for Unicode in Zle
}
} It would be good at least to get 4.2.2 out of the way before anything
} more than groundwork appears, though.

Oops, that's what I get for not wading through the entire thread before
responding to the first message.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Some groundwork for Unicode in Zle
  2005-01-11 13:52 Peter Stephenson
                   ` (2 preceding siblings ...)
  2005-01-11 15:09 ` DervishD
@ 2005-01-11 16:27 ` Bart Schaefer
  2005-01-11 16:35   ` Vin Shelton
  2005-01-11 16:49   ` DervishD
  3 siblings, 2 replies; 19+ messages in thread
From: Bart Schaefer @ 2005-01-11 16:27 UTC (permalink / raw)
  To: Zsh hackers list

On Jan 11,  1:52pm, Peter Stephenson wrote:
}
} It seems clear that the line editor is the place most people are
} missing Unicode support, so I suggest we start from there and work
} back.

Just as a remark, if you're going to alter the line editor then you
probably have no choice but to alter the "read" builtin as well, at
least "read -k" because it's used in some user-defined widgets (and this
could get a bit ugly, as how does "read" know whether it's reading a
wide char stream or a simple byte stream?).

Keep in mind that the completion system DOES make use of nul-terminated
strings -- cf. the "bug in insert-last-word" thread, e.g. 20643.

} All the tests still pass, so I will commit this some time today.

If you haven't committed already, might I suggest doing a 4.2.2 release
first?  It's been 4 months and a lot of patches since 4.2.1, and this
is a potentially very lengthy effort to embark upon.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Some groundwork for Unicode in Zle
  2005-01-11 14:51 ` Oliver Kiddle
  2005-01-11 14:54   ` Mads Martin Joergensen
@ 2005-01-11 15:30   ` Peter Stephenson
  2005-01-11 16:31     ` Bart Schaefer
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Stephenson @ 2005-01-11 15:30 UTC (permalink / raw)
  To: Zsh hackers list

> Peter wrote:
> > from how the line is encoded internally.  We can use wchar_t inside and
> > pass back a multibyte string.
> 
> Good to see this being addressed. How do you plan to cope with encoding
> nulls if you use wchar_t? (or does zle not bother?) The whole meta stuff
> is what really scared me off ever touching this.

Do you mean null-terminating a string?  I don't think we need that
inside ZLE, though it's easy to get into confusion with the conversions
needed for completion.  Or do you mean difficulties with L'\0'?

> > I've made a very dull patch that does a few things that might make adding
> > Unicode support to Zle easier.  Actually, I think within Zle it should
> > be easy to use generic wchar_t's and not worry about whether they're
> > really Unicode, but I still propose to rely on __STDC_ISO_10646__ to
> 
> Why? Relying on __STDC_ISO_10646__ will rule out a good number of
> systems that do otherwise have good support for multibyte encodings such
> as UTF-8. __STDC_ISO_10646__ is defined on surprisingly few systems. We
> really don't care about what wchar_t is internally if we let libc do our
> conversions.

This definition means we can use wchar_t in a way natural for Unicode
(yes, it doesn't matter if that really is Unicode or something else)
without worrying, and likewise there is compiler support.  For example,
it means L'\0' etc. works.

I'm really not interested in fudging round that sort of thing at this
stage, nor random #ifdef fudges.  Anything I do is likely to be on
Linux.  If someone else once to see what the effect of relaxing the test
is and how it can be fixed up, if necessary, I will be delighted.  For
me it's an unnecessary complication stopping me getting something going.

> > Before I get to details of what I've patched so far, one question: how
> > do we turn input into characters?  My first thought was to do it at a low
> > level around getkey, possibly in getkeybuf which already does
> 
> That would seem more sensible to me. Allowing partial multi-byte
> sequences to be bound is not very nice and probably not very useful.

I'm not sure I agree.  Suppose you have a Meta-sequence bound, i.e. a
binding for some ASCII character with the high bit set.  This conflicts
with a fully working UTF-8 input system, but you may well not have that,
just a system which happens to boot up with a UTF-8 locale.
Implementing only wchar_t based lookups will break this completely.
That seems pretty fatal to me.

> > The actual Unicode-related changes are minimal.  system.h shows how I
> 
> Did you mean to attach an actual patch?

No, it's huge with all the changes of names.

> > #if defined(HAVE_WCHAR_H) && defined(HAVE_WCTOMB) && defined (__STDC_ISO_10646__)
> > # include <wchar.h>
> 
> You'll probably want to include wchar.h even if __STDC_ISO_10646__ is
> not defined. For \u/\U, wchar_t was only useful when converting from
> unicode to wchar_t could be done trivially: when __STDC_ISO_10646__ is
> defined. It otherwise uses iconv or a hardcoded UTF-8 conversion. For
> zle, I can't think of any instance where you would care whether whar_t
> is unicode.

You may well be right, but again it's just another unnecessary
complication at this stage and I don't have a system to test the
effect.

> > /*
> >  * More stringent requirements to enable complete Unicode conversion
> >  * between wide characters and multibyte strings.
> >  */
> > #if defined(HAVE_MBTOWC)
> > /*#define ZLE_UNICODE_SUPPORT	1*/
> 
> I don't quite follow the logic of that check.
> 
> I wouldn't have thought ZLE_UNICODE_SUPPORT is a good name for the
> define. The requirement is to support multibyte character encodings, not
> specifically "unicode" and the same define will probably be extended to
> areas outside of zle. How about ENABLE_MULTIBYTE, perhaps linked to a
> configure --disable-multibyte option.

The *requirement* is to support Unicode, in particular UTF-8.  The
*hope* is to be able to allow other schemes without much work.  The
whole point of Unicode is to replace other schemes anyway, and they are
unlikely ever to be well-tested even if they work.

I'm strongly of the opinion we should stick with multibyte strings
outside Zle, and continue to have the interface to ZLE pass back such a
string.  I think the difficulties and costs of extending the use of
wchar_t outside ZLE would be prohibitive for very little gain.  So
this definition does apply only to ZLE, and (although this is not
necessarily all it does) is specifically targeted at making Unicode
schemes work.

> > typedef wchar_t *ZLE_STRING_T;
> > #else
> > typedef int ZLE_CHAR_T;
> 
> Why int and not unsigned char? Is it really worth having the separate
> STRING type? Again, I wouldn't use "ZLE" in the name given that we may
> want to use it outside zle someday.

This is uncontroversial; it's what we do at the moment.  Functions
returning a single character return an int and functions dealing with a
string use an unsigned char *.  The separate definitions wouldn't be
necessary if we just had wchar_t arrays, it's the backward compatibility
that makes it necessary.

As I said, I have no intention of ever using this scheme outside ZLE.

> > All the tests still pass, so I will commit this some time today.
> 
> Would it be worth creating a separate branch for multibyte support? It
> could later become 4.3. If so I'd suggest we continue to commit
> everything non-multibyte related to the current branch to avoid the old
> issue of the current release being very old.

(We'd probably do it the other way --- create a separate stable 4.2
branch so the new one was a mainline.)  It may become necessary, but
this will be just another way of slowing things down, so I'd like to
avoid it until things become too hairy.

It would be good at least to get 4.2.2 out of the way before anything
more than groundwork appears, though.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**********************************************************************

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Some groundwork for Unicode in Zle
  2005-01-11 13:59 ` Peter Stephenson
@ 2005-01-11 15:11   ` DervishD
  2005-01-14 13:10   ` Peter Stephenson
  1 sibling, 0 replies; 19+ messages in thread
From: DervishD @ 2005-01-11 15:11 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

    Hi Peter :)

 * Peter Stephenson <pws@csr.com> dixit:
> Peter Stephenson wrote:
> > I'm still not sure how to test whether a multibyte
> > string is invalid rather than incomplete.
> After a little more research, it's obvious we should be using
> mbrtowc, which does distinguish.  So I'll test for that and
> wcrtomb.

    Sorry, I haven't received this message when I wrote mine, Sorry
for the noise O:)

    Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/
It's my PC and I'll cry if I want to...


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Some groundwork for Unicode in Zle
  2005-01-11 13:52 Peter Stephenson
  2005-01-11 13:59 ` Peter Stephenson
  2005-01-11 14:51 ` Oliver Kiddle
@ 2005-01-11 15:09 ` DervishD
  2005-01-11 16:27 ` Bart Schaefer
  3 siblings, 0 replies; 19+ messages in thread
From: DervishD @ 2005-01-11 15:09 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

    Hi Peter :)

 * Peter Stephenson <pws@csr.com> dixit:
> I'm still not sure how to test whether a multibyte
> string is invalid rather than incomplete.

    AFAIK, mbrtowc returns -1 for invalid and -2 for incomplete (and
a 0 for the NULL widechar). Although the return value is 'size_t' is
nearly impossible for the function to return (size_t) -1 (or -2) for
a *correct* conversion in UTF-8. I think there is a safe bet to
assume that even an ill-formed operating system not honoring C99 will
have SIZE_MAX greater than 8 XDDDDD

    I haven't used myself the restartable set of multibyte functions,
so please take a look at SUS, POSIX or whatever to see the gory
details for this function. I've used SUS for reference.
 
    Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/
It's my PC and I'll cry if I want to...


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Some groundwork for Unicode in Zle
  2005-01-11 14:54   ` Mads Martin Joergensen
@ 2005-01-11 14:56     ` Mads Martin Joergensen
  0 siblings, 0 replies; 19+ messages in thread
From: Mads Martin Joergensen @ 2005-01-11 14:56 UTC (permalink / raw)
  To: Zsh hackers list

* Mads Martin Joergensen <mmj@suse.de> [Jan 11. 2005 15:54]:
> > > from how the line is encoded internally.  We can use wchar_t inside and
> > > pass back a multibyte string.
> > 
> > Good to see this being addressed. How do you plan to cope with encoding
> > nulls if you use wchar_t? (or does zle not bother?) The whole meta stuff
> > is what really scared me off ever touching this.
> 
> It's not just 'good', it's really awesome. Thanks for doing this!

Forgot to mention. If you'll release development versions with this,
I'll build packages that people can and will test.

-- 
Mads Martin Joergensen, http://mmj.dk
"Why make things difficult, when it is possible to make them cryptic
 and totally illogical, with just a little bit more effort?"
                                -- A. P. J.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Some groundwork for Unicode in Zle
  2005-01-11 14:51 ` Oliver Kiddle
@ 2005-01-11 14:54   ` Mads Martin Joergensen
  2005-01-11 14:56     ` Mads Martin Joergensen
  2005-01-11 15:30   ` Peter Stephenson
  1 sibling, 1 reply; 19+ messages in thread
From: Mads Martin Joergensen @ 2005-01-11 14:54 UTC (permalink / raw)
  To: Zsh hackers list

* Oliver Kiddle <okiddle@yahoo.co.uk> [Jan 11. 2005 15:52]:
> Peter wrote:
> > from how the line is encoded internally.  We can use wchar_t inside and
> > pass back a multibyte string.
> 
> Good to see this being addressed. How do you plan to cope with encoding
> nulls if you use wchar_t? (or does zle not bother?) The whole meta stuff
> is what really scared me off ever touching this.

It's not just 'good', it's really awesome. Thanks for doing this!

-- 
Mads Martin Joergensen, http://mmj.dk
"Why make things difficult, when it is possible to make them cryptic
 and totally illogical, with just a little bit more effort?"
                                -- A. P. J.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Some groundwork for Unicode in Zle
  2005-01-11 13:52 Peter Stephenson
  2005-01-11 13:59 ` Peter Stephenson
@ 2005-01-11 14:51 ` Oliver Kiddle
  2005-01-11 14:54   ` Mads Martin Joergensen
  2005-01-11 15:30   ` Peter Stephenson
  2005-01-11 15:09 ` DervishD
  2005-01-11 16:27 ` Bart Schaefer
  3 siblings, 2 replies; 19+ messages in thread
From: Oliver Kiddle @ 2005-01-11 14:51 UTC (permalink / raw)
  To: Zsh hackers list

Peter wrote:
> from how the line is encoded internally.  We can use wchar_t inside and
> pass back a multibyte string.

Good to see this being addressed. How do you plan to cope with encoding
nulls if you use wchar_t? (or does zle not bother?) The whole meta stuff
is what really scared me off ever touching this.

> I've made a very dull patch that does a few things that might make adding
> Unicode support to Zle easier.  Actually, I think within Zle it should
> be easy to use generic wchar_t's and not worry about whether they're
> really Unicode, but I still propose to rely on __STDC_ISO_10646__ to

Why? Relying on __STDC_ISO_10646__ will rule out a good number of
systems that do otherwise have good support for multibyte encodings such
as UTF-8. __STDC_ISO_10646__ is defined on surprisingly few systems. We
really don't care about what wchar_t is internally if we let libc do our
conversions.

> Before I get to details of what I've patched so far, one question: how
> do we turn input into characters?  My first thought was to do it at a low
> level around getkey, possibly in getkeybuf which already does

That would seem more sensible to me. Allowing partial multi-byte
sequences to be bound is not very nice and probably not very useful.

> The actual Unicode-related changes are minimal.  system.h shows how I

Did you mean to attach an actual patch?

> #if defined(HAVE_WCHAR_H) && defined(HAVE_WCTOMB) && defined (__STDC_ISO_10646__)
> # include <wchar.h>

You'll probably want to include wchar.h even if __STDC_ISO_10646__ is
not defined. For \u/\U, wchar_t was only useful when converting from
unicode to wchar_t could be done trivially: when __STDC_ISO_10646__ is
defined. It otherwise uses iconv or a hardcoded UTF-8 conversion. For
zle, I can't think of any instance where you would care whether whar_t
is unicode.

> /*
>  * More stringent requirements to enable complete Unicode conversion
>  * between wide characters and multibyte strings.
>  */
> #if defined(HAVE_MBTOWC)
> /*#define ZLE_UNICODE_SUPPORT	1*/

I don't quite follow the logic of that check.

I wouldn't have thought ZLE_UNICODE_SUPPORT is a good name for the
define. The requirement is to support multibyte character encodings, not
specifically "unicode" and the same define will probably be extended to
areas outside of zle. How about ENABLE_MULTIBYTE, perhaps linked to a
configure --disable-multibyte option.

> typedef wchar_t *ZLE_STRING_T;
> #else
> typedef int ZLE_CHAR_T;

Why int and not unsigned char? Is it really worth having the separate
STRING type? Again, I wouldn't use "ZLE" in the name given that we may
want to use it outside zle someday.

> All the tests still pass, so I will commit this some time today.

Would it be worth creating a separate branch for multibyte support? It
could later become 4.3. If so I'd suggest we continue to commit
everything non-multibyte related to the current branch to avoid the old
issue of the current release being very old.

Oliver

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Some groundwork for Unicode in Zle
  2005-01-11 13:52 Peter Stephenson
@ 2005-01-11 13:59 ` Peter Stephenson
  2005-01-11 15:11   ` DervishD
  2005-01-14 13:10   ` Peter Stephenson
  2005-01-11 14:51 ` Oliver Kiddle
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 19+ messages in thread
From: Peter Stephenson @ 2005-01-11 13:59 UTC (permalink / raw)
  To: Zsh hackers list

Peter Stephenson wrote:
> I'm still not sure how to test whether a multibyte
> string is invalid rather than incomplete.

After a little more research, it's obvious we should be using
mbrtowc, which does distinguish.  So I'll test for that and
wcrtomb.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**********************************************************************

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Some groundwork for Unicode in Zle
@ 2005-01-11 13:52 Peter Stephenson
  2005-01-11 13:59 ` Peter Stephenson
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Peter Stephenson @ 2005-01-11 13:52 UTC (permalink / raw)
  To: Zsh hackers list

It seems clear that the line editor is the place most people are missing
Unicode support, so I suggest we start from there and work back.  It's
relatvely self-contained in that we can protect the rest of the shell
from how the line is encoded internally.  We can use wchar_t inside and
pass back a multibyte string.

I've made a very dull patch that does a few things that might make adding
Unicode support to Zle easier.  Actually, I think within Zle it should
be easy to use generic wchar_t's and not worry about whether they're
really Unicode, but I still propose to rely on __STDC_ISO_10646__ to
ensure us we have a suitable environment, since otherwise it opens a
huge can of worms.  (If it seems easy to fix up afterwards, fine, but
it could be a lot of work for ever decreasing gain.)

Before I get to details of what I've patched so far, one question: how
do we turn input into characters?  My first thought was to do it at a low
level around getkey, possibly in getkeybuf which already does
metafication.  This would loop until it picked up enough bytes for a
wide character, and would return that.  Then essentially all higher
level uses of characters in ZLE would be based around wchar_t, including
looking up keys.  (This would mean much greater use of sparse keymaps,
though we could keep a dense keymap for the first 128 characters and not
lose much efficiency.)  The advantage is this is transparent to input
systems that handle multibyte strings properly.  The disadvantage is
that for other 8-bit characters you can get stuck.

To get around that it would be possible to keep the input as multibyte
strings until the keymap lookup.  That's much more conservative and
makes it easier to handle older input systems, bind single-byte
characters with the high bit set (if you still want to), etc.  Then
we possibly need some smart way of doing self-insert.  For example, it
could be made to test pending input for a complete multibyte character
and convert it.  I'm still not sure how to test whether a multibyte
string is invalid rather than incomplete.

The present change:

The vast majority of the patch is simply to get rid of the lies in the
header about the names of variables.  cs and ll are now zlecs and zlell
throughout instead of being #define'd (they used to be zshcs and zshll
in the definition but I thought the new names were more consistent).
Also, the zle pointers are called by name instead of a by a macro
defined to be the name of the zle function.  This isn't directly related
to Unicode but has been annoying me for ages.

The actual Unicode-related changes are minimal.  system.h shows how I
suggest deciding whether to compile in support into zle.  At the minimum
we need wctomb and mbtowc as well as C support.  (For future
sophistication we will want wcwidth etc. but we can build a working
system for many character sets without it.)  ZLE_UNICODE_SUPPORT will be
defined to 1 when the conditions are met.  The header chunk looks like
this.  Part of it is moved from utils.c.

/*
 * This is a subset of ZLE_UNICODE_SUPPORT.  It is not all that likely
 * that only the subset is supported, however it's easy to make the
 * \u and \U escape sequences work with just the following.
 */
#if defined(HAVE_WCHAR_H) && defined(HAVE_WCTOMB) && defined (__STDC_ISO_10646__)
# include <wchar.h>

/*
 * More stringent requirements to enable complete Unicode conversion
 * between wide characters and multibyte strings.
 */
#if defined(HAVE_MBTOWC)
/*#define ZLE_UNICODE_SUPPORT	1*/
#endif
#else
# ifdef HAVE_LANGINFO_H
#   include <langinfo.h>
#   if defined(HAVE_ICONV) || defined(HAVE_LIBICONV)
#     include <iconv.h>
#   endif
# endif
#endif

#ifdef ZLE_UNICODE_SUPPORT
typedef wchar_t ZLE_CHAR_T;
typedef wchar_t *ZLE_STRING_T;
#else
typedef int ZLE_CHAR_T;
typedef unsigned char *ZLE_STRING_T;
#endif

The other change is that I have made the variable "line" local to Zle.
This required adding the function zlegetline to return the line so far.
This is currently trivial, but will eventually look like part of the return
sequence frome zlegetline.  As we need all the help we can get, I have
renamed "line" to "zleline" --- a local variable "line" is used internally
in many places in the completion code, and the word occurs in all sorts
of comments, so it was hard to locate uses of the variable.

Apart from the fact that no one understands how the command line is used
inside the completion code, there is also the problem that zlecs and
zlell (cursor position and line length) are exposed in lex.c and hist.c
for use when analysing a line for completion.  This will be a problem
when ZLE is measuring in characters and lex.c in bytes.  I tried to
separate out the variables into lexcs and lexll, but didn't get it to
work.

All the tests still pass, so I will commit this some time today.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**********************************************************************

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2005-01-15 19:28 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-14 15:54 Some groundwork for Unicode in Zle François-Xavier Coudert
2005-01-14 16:06 ` Clint Adams
2005-01-14 16:20   ` François-Xavier Coudert
2005-01-14 16:49 ` Peter Stephenson
  -- strict thread matches above, loose matches on Subject: below --
2005-01-11 13:52 Peter Stephenson
2005-01-11 13:59 ` Peter Stephenson
2005-01-11 15:11   ` DervishD
2005-01-14 13:10   ` Peter Stephenson
2005-01-15 17:35     ` Clint Adams
2005-01-15 19:28       ` Peter Stephenson
2005-01-11 14:51 ` Oliver Kiddle
2005-01-11 14:54   ` Mads Martin Joergensen
2005-01-11 14:56     ` Mads Martin Joergensen
2005-01-11 15:30   ` Peter Stephenson
2005-01-11 16:31     ` Bart Schaefer
2005-01-11 15:09 ` DervishD
2005-01-11 16:27 ` Bart Schaefer
2005-01-11 16:35   ` Vin Shelton
2005-01-11 16:49   ` DervishD

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).