zsh-workers
 help / color / mirror / code / Atom feed
* RE: UTF-8 fonts
@ 2002-09-25 11:11 Borzenkov Andrey
  2002-09-25 11:36 ` Peter Stephenson
  0 siblings, 1 reply; 10+ messages in thread
From: Borzenkov Andrey @ 2002-09-25 11:11 UTC (permalink / raw)
  To: 'Zsh hackers list'

Just to make it clear. Is the aim to use UTF-8 internally or to support
(arbitrary) multibyte encoding?

> See http://www.cl.cam.ac.uk/~mgk25/unicode.html for a nice summary of
> the subject.
> 
> My first thought about using UTF-8 instead of eight bit characters

this sounds like you want to convert input to UTF-8 internally?

> was
> that we would have to replace the current `Meta' system.  However, I
> don't think we do since the current system will seamlessly translate
> from UTF-8 input to UTF-8 output.
> 
> Therefore, all we have to do is modify the shell's internals at the
> point where it actually compares characters --- or, more generally,
> tries to turn metafied sequences into a single character --- to use the
> normal UTF8 rules.  There may also be some extra places where counting
> the length needs changing.
> 

You also need to modify any place where shell compares or translates (upper
<-> lower) characters. This is by definition locale dependent - collating
order is different is different languages even when they use the same
character set. Which means you can use UTF-8 (or, more generally, any
multibyte encoding) only if your current locale supports it. Which in effect
means using wc* and mb* function suite anyway.

But this also means you cannot assume anything about current character set
and cannot assume that it is transparent w.r.t. current string handling in
zsh.

> Unicode characters are up to 6 bytes, so either with 64-bit integers we
> can do a direct comparison some bit arithmetic, or we can just use
> strncmp.  (I don't fancy relying on internationalisation support for
> this this but in principle that's probably the right thing to do.)
> Hence I don't see the necessity for actually decoding UTF-8 into Unicode
> at any point, just deciding the number of bytes.  Not doing this avoids
> problems with overlong encodings (ones which illegally represent a
> character using too many bytes): an overlong encoding will always
> compare differently to the standard encoding.
> 

How do you know your input (and strings you are processing) are UTF-8?
Besides, standards do not provide a way to input multibyte character - you
can only read wide character.

> Probably we need a configuration option to switch this on or off.
> 

Yes, either we rely on standard locale support (and do not care what
character set is being used) or we must provide some OOB means to define
character set in use. 

> Zle might be a bit more of a problem.  The web page I referred to above
> gives the hopeful message that all encoding to/decoding from UTF-8 at
> the terminal is handled by the terminal driver.  So for zle we have to
> worry about things like
> - determining whether the terminal is actually in UTF-8 mode, probably
>   from the locale

Impossible. Local names are just arbitrary chosen strings; there is no
"character set code" defined in any locale definition, at least on Unix.

> - how UTF-8 encoded characters interfere with meta-bindings.  May be
>   good enough simply not to use these, at least while we work out what's
>   what
> - reading multi-byte characters --- timeouts and the like

use standard OS interfaces to read wide characters.

> - getting the right length for displaying, deleting, copying
>   etc. multi-byte characters.  Apart from counting continutation
>   bytes, we may be stuck with using wcwidth for display.  This is a pain
>   because it involves explicity wchar_t's, and I have no experience at
>   all with these (except that they mess up compilation of otherwise
> trivial
>   string-handling functions).
> - all the stuff I've forgotten.
> 
> Any comments?
> 

-andrey


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UTF-8 fonts
  2002-09-25 11:11 UTF-8 fonts Borzenkov Andrey
@ 2002-09-25 11:36 ` Peter Stephenson
  2002-09-25 13:27   ` Nadav Har'El
  2002-09-25 17:29   ` Oliver Kiddle
  0 siblings, 2 replies; 10+ messages in thread
From: Peter Stephenson @ 2002-09-25 11:36 UTC (permalink / raw)
  To: Zsh hackers list

Borzenkov Andrey wrote:
> Just to make it clear. Is the aim to use UTF-8 internally or to support
> (arbitrary) multibyte encoding?

The first with as much of the second as we can get in without too much
work.  My current plan is to rely on mbtowc/mblen to identify and match
multibyte characters in strings where we need to.  These are already
aware of the locale, so our only assumption is that characters without
the top bit set are ASCII; this is the major limitation on any non-UTF-8
multibyte encodings --- it's this feature of UTF-8 that makes it so
suitable for UNIX use.  With this assumption wide characters are
transparent to the vast majority of the shell and we only need to look
at the characters for comparisons between characters, lengths of strings
and testing the size for output.

> > See http://www.cl.cam.ac.uk/~mgk25/unicode.html for a nice summary of
> > the subject.
> > 
> > My first thought about using UTF-8 instead of eight bit characters
> 
> this sounds like you want to convert input to UTF-8 internally?

If you read the link, you will see that the plan as far as stream
applications are concerned is that the input is already UTF-8 and the
output will be treated as UTF-8.  So we don't do any conversion except
for the cases I mentioned.

> You also need to modify any place where shell compares or translates (upper
> <-> lower) characters.

We decided some time ago not to use strcoll, because it broke in some
nasty ways.  So it's now documented that we just use character positions
in the character set for comparisons.  This has generated far fewer
complaints (as far as I'm aware, none) than the previous version.  It
seems inevitable to extend this to multibyte characters.

> But this also means you cannot assume anything about current character set
> and cannot assume that it is transparent w.r.t. current string handling in
> zsh.

We are going to assume that bytes without the top-bit set are ASCII, and
the remainder require mb* handling.

> How do you know your input (and strings you are processing) are UTF-8?
> Besides, standards do not provide a way to input multibyte character - you
> can only read wide character.

No, as I said above the whole point of UTF-8 is that you can for the
most part just use normal strings.  I am not planning on supporting any
system that doesn't have this feature.

> > - determining whether the terminal is actually in UTF-8 mode, probably
> >   from the locale
> 
> Impossible. Local names are just arbitrary chosen strings; there is no
> "character set code" defined in any locale definition, at least on Unix.

Read the document at the link I gave which suggests otherwise.  However,
I now think we can in any case leave this to the mb* suite to decide.

> > - reading multi-byte characters --- timeouts and the like
> 
> use standard OS interfaces to read wide characters.

No, because we are reading them as individual bytes.  If you don't have
the complete multibyte character you are pretty stuck unless you
interpret a partial character yourself, hence the problems with active
meta-characters in zle.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR Ltd., Science Park, Milton Road,
Cambridge, CB4 0WH, UK                          Tel: +44 (0)1223 692070


**********************************************************************
The information transmitted is intended only for the person or
entity to which it is addressed and may contain confidential 
and/or privileged material. 
Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by 
persons or entities other than the intended recipient is 
prohibited.  
If you received this in error, please contact the sender and 
delete the material from any computer.
**********************************************************************


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UTF-8 fonts
  2002-09-25 11:36 ` Peter Stephenson
@ 2002-09-25 13:27   ` Nadav Har'El
  2002-09-25 17:29   ` Oliver Kiddle
  1 sibling, 0 replies; 10+ messages in thread
From: Nadav Har'El @ 2002-09-25 13:27 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

On Wed, Sep 25, 2002, Peter Stephenson wrote about "Re: UTF-8 fonts":
> > > - determining whether the terminal is actually in UTF-8 mode, probably
> > >   from the locale
> > 
> > Impossible. Local names are just arbitrary chosen strings; there is no
> > "character set code" defined in any locale definition, at least on Unix.
> 
> Read the document at the link I gave which suggests otherwise.  However,
> I now think we can in any case leave this to the mb* suite to decide.

Here's a piece of code I used in one of my programs to tell whether the
user's terminal is in utf8 mode, based on the locale:

        setlocale(LC_CTYPE, "");
        is_utf8= !strcmp(nl_langinfo(CODESET),"UTF-8");


-- 
Nadav Har'El                        |   Wednesday, Sep 25 2002, 19 Tishri 5763
nyh@math.technion.ac.il             |-----------------------------------------
Phone: +972-53-245868, ICQ 13349191 |For people who like peace and quiet - a
http://nadav.harel.org.il           |phoneless cord.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UTF-8 fonts
  2002-09-25 11:36 ` Peter Stephenson
  2002-09-25 13:27   ` Nadav Har'El
@ 2002-09-25 17:29   ` Oliver Kiddle
  2002-09-25 17:50     ` Peter Stephenson
  1 sibling, 1 reply; 10+ messages in thread
From: Oliver Kiddle @ 2002-09-25 17:29 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

On 25 Sep, Peter Stephenson wrote:
> Borzenkov Andrey wrote:
> > Just to make it clear. Is the aim to use UTF-8 internally or to support
> > (arbitrary) multibyte encoding?
> 
> The first with as much of the second as we can get in without too much

So is your aim to use UTF-8 internally in all cases or only when it is
the selected character set? I would have thought it would be easier to
just use whatever LC_CTYPE (the locale's selected encoding) is
internally and use the mb* functions so things work regardless of
whether or not LC_CTYPE is a multi-byte character encoding. I don't
know much about other multi-byte character encodings that can be used
for the input/output locale but I had gathered they at least have the
level of compatibility with basic ASCII that allows you to use ASCII
characters in string literals. To convert everything to UTF-8
internally, you would have to either use iconv or do messy stuff: the
mb* functions deal with whatever LC_CTYPE is and not UTF-8 (unless
that's what LC_CTYPE happens to be of course).

> We are going to assume that bytes without the top-bit set are ASCII, and
> the remainder require mb* handling.

Isn't it easier to just do mb* handling on everything and not go around
checking the top bit. The mb*() functions should do that sort of stuff
for us. mbrtowc() can be used, discarding the returned wchar_t to, for
example consume one character of a string. So it worries about whatever
the top bit of the bytes are or whatever the underlying multi-byte
character encoding requires.

> > Impossible. Local names are just arbitrary chosen strings; there is no
> > "character set code" defined in any locale definition, at least on Unix.

as has been mentioned: nl_langinfo(CODESET)

> Read the document at the link I gave which suggests otherwise.  However,
> I now think we can in any case leave this to the mb* suite to decide.

Yes, I think we can.

I'm sure you can all use google, but other possibly useful links I had
in my bookmarks are these:

  IBM's patches to various GNU stuff:
    https://www-124.ibm.com/developer/opensource/linux/patches/i18n/
  IBM article that serves as a basic intro:
    http://www-106.ibm.com/developerworks/library/l-linuni.html
  howto
    http://www.tldp.org/HOWTO/Unicode-HOWTO-6.html

Oliver

This e-mail and any attachment is for authorised use by the intended recipient(s) only.  It may contain proprietary material, confidential information and/or be subject to legal privilege.  It should not be copied, disclosed to, retained or used by, any other party.  If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender.  Thank you.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UTF-8 fonts
  2002-09-25 17:29   ` Oliver Kiddle
@ 2002-09-25 17:50     ` Peter Stephenson
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Stephenson @ 2002-09-25 17:50 UTC (permalink / raw)
  To: Zsh hackers list

Oliver Kiddle wrote:
> So is your aim to use UTF-8 internally in all cases or only when it is
> the selected character set?

My aim is to use normal metafied characters whenever possible and not
worry how the byte stream is encoded until the few points where we don't
have a choice, and then to use system functions.

> > We are going to assume that bytes without the top-bit set are ASCII, and
> > the remainder require mb* handling.
> 
> Isn't it easier to just do mb* handling on everything and not go around
> checking the top bit. The mb*() functions should do that sort of stuff
> for us. mbrtowc() can be used, discarding the returned wchar_t to, for
> example consume one character of a string. So it worries about whatever
> the top bit of the bytes are or whatever the underlying multi-byte
> character encoding requires.

This is where we have a choice, but I think treating even normal ASCII
bytes with arbitrary functions is going to be horrendously inefficient.
Personally I would be perfectly happy if the shell worked only with
schemes which were extensions of ASCII.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR Ltd., Science Park, Milton Road,
Cambridge, CB4 0WH, UK                          Tel: +44 (0)1223 692070


**********************************************************************
The information transmitted is intended only for the person or
entity to which it is addressed and may contain confidential 
and/or privileged material. 
Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by 
persons or entities other than the intended recipient is 
prohibited.  
If you received this in error, please contact the sender and 
delete the material from any computer.
**********************************************************************


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UTF-8 fonts
  2002-09-24 16:03   ` Clint Adams
@ 2002-09-24 17:41     ` Peter Stephenson
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Stephenson @ 2002-09-24 17:41 UTC (permalink / raw)
  To: Zsh hackers list

Clint Adams wrote:
> The mb*/wc* functions should be much more useful.

Looks like a bare minimum implementation could be done with mbtowc() and
wcwidth(), to get zle output fixed up.  mbtowc() returns a length (as
does mblen()) so if we rely on these two working as per standard we
should be OK.  In that case we can avoid counting character lengths
ourself, which is more portable.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR Ltd., Science Park, Milton Road,
Cambridge, CB4 0WH, UK                          Tel: +44 (0)1223 692070


**********************************************************************
The information transmitted is intended only for the person or
entity to which it is addressed and may contain confidential 
and/or privileged material. 
Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by 
persons or entities other than the intended recipient is 
prohibited.  
If you received this in error, please contact the sender and 
delete the material from any computer.
**********************************************************************


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UTF-8 fonts
  2002-09-24 13:39 ` Oliver Kiddle
@ 2002-09-24 16:03   ` Clint Adams
  2002-09-24 17:41     ` Peter Stephenson
  0 siblings, 1 reply; 10+ messages in thread
From: Clint Adams @ 2002-09-24 16:03 UTC (permalink / raw)
  To: Oliver Kiddle; +Cc: zsh-workers

>   http://www.ono.org/software/zsh-euc/
> Not that I've looked at it in detail - I just found it some time and
> put it in my bookmarks.

It seems to be woefully incomplete, and uses a lookup table, so it's
only good for EUC (EUC-JP?).

The mb*/wc* functions should be much more useful.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UTF-8 fonts
  2002-09-19 16:56 Peter Stephenson
  2002-09-19 18:14 ` Clint Adams
@ 2002-09-24 13:39 ` Oliver Kiddle
  2002-09-24 16:03   ` Clint Adams
  1 sibling, 1 reply; 10+ messages in thread
From: Oliver Kiddle @ 2002-09-24 13:39 UTC (permalink / raw)
  To: zsh-workers

This certainly seems to be the most requested feature these days so I
would echo what Clint said in his response.

On 19 Sep, you wrote:
> See http://www.cl.cam.ac.uk/~mgk25/unicode.html for a nice summary of
> the subject.

The info pages for glibc also have useful sections on this. I should
probably also mention the patch someone has at:
  http://www.ono.org/software/zsh-euc/
Not that I've looked at it in detail - I just found it some time and
put it in my bookmarks.
 
> My first thought about using UTF-8 instead of eight bit characters was
> that we would have to replace the current `Meta' system.  However, I
> don't think we do since the current system will seamlessly translate
> from UTF-8 input to UTF-8 output.

As far as I can see the `Meta' system should work unchanged.

There are other multibyte character sets which ideally we should be
concerned about besides just UTF-8. Does anyone know if we need to be
concerned with any which are stateful?.

> Therefore, all we have to do is modify the shell's internals at the
> point where it actually compares characters --- or, more generally,
> tries to turn metafied sequences into a single character --- to use the
> normal UTF8 rules.  There may also be some extra places where counting
> the length needs changing.

Basically any dealing with string lengths, substrings or with single
characters will need modifying.

> Unicode characters are up to 6 bytes, so either with 64-bit integers we
> can do a direct comparison some bit arithmetic, or we can just use
> strncmp.  (I don't fancy relying on internationalisation support for
> this this but in principle that's probably the right thing to do.)

I'm not sure what you're trying to achieve with this 64-bit integer
suggestion. strncmp should be fine for string comparisons and I
wouldn't bet against that 6-byte maximum being increased ever.

> Hence I don't see the necessity for actually decoding UTF-8 into Unicode
> at any point, just deciding the number of bytes.  Not doing this avoids
> problems with overlong encodings (ones which illegally represent a
> character using too many bytes): an overlong encoding will always
> compare differently to the standard encoding.

I think converting things to unicode/wchar_t's would be a bad idea and
it would be better to stick to whatever LC_CTYPE is for everything
internal (except temporarily in some operations). wchar_t is
implementation defined (though ISO10646 (unicode) is fairly commonly
used for it) so it would break the portability of word code files if
they were to be stored using wchar_t's. Admittedly people could get
themselves into a mess by switching between different encodings but
that is nothing that you don't get anyway with text files. It also
makes implementing this easier because it can be done gradually, fixing
areas where the shell doesn't work for multi-byte characters whereas
using wchar_t's would need rewriting large chunks of everything at
once.

> Probably we need a configuration option to switch this on or off.

A good starting point would be a configuration option, the basic
autoconf tests and then much of the common stuff in string.c for things
like getting substrings and counting string lengths. Getting various
parts of the shell to work with multibyte character sets can then be
done piece by piece.

> Zle might be a bit more of a problem.  The web page I referred to above
> gives the hopeful message that all encoding to/decoding from UTF-8 at
> the terminal is handled by the terminal driver.  So for zle we have to
> worry about things like
> - determining whether the terminal is actually in UTF-8 mode, probably
>   from the locale

That'll be from the locale (LC_CTYPE) but I think that by using
functions like mbrlen(), you get libc to worry about that for you. 

> - how UTF-8 encoded characters interfere with meta-bindings.  May be
>   good enough simply not to use these, at least while we work out what's
>   what

I suppose that is initially a question of finding out what something
like xterm generates for meta keys when in utf-8 mode. A quick test
with cat -v reveals empty square boxes as I type and things like `M-e'
when I look at the redirected file.

> - reading multi-byte characters --- timeouts and the like

> - getting the right length for displaying, deleting, copying
>   etc. multi-byte characters.  Apart from counting continutation

note that there are multi-row characters to contend with as a separate
issue to multi-byte characters. Though we already sort of have these
with things like ^[. 

Oliver

This e-mail and any attachment is for authorised use by the intended recipient(s) only.  It may contain proprietary material, confidential information and/or be subject to legal privilege.  It should not be copied, disclosed to, retained or used by, any other party.  If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender.  Thank you.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UTF-8 fonts
  2002-09-19 16:56 Peter Stephenson
@ 2002-09-19 18:14 ` Clint Adams
  2002-09-24 13:39 ` Oliver Kiddle
  1 sibling, 0 replies; 10+ messages in thread
From: Clint Adams @ 2002-09-19 18:14 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

> Any comments?

I wholeheartedly support this; it will also make East-Asian multi-byte
characters be handled properly.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* UTF-8 fonts
@ 2002-09-19 16:56 Peter Stephenson
  2002-09-19 18:14 ` Clint Adams
  2002-09-24 13:39 ` Oliver Kiddle
  0 siblings, 2 replies; 10+ messages in thread
From: Peter Stephenson @ 2002-09-19 16:56 UTC (permalink / raw)
  To: Zsh hackers list

See http://www.cl.cam.ac.uk/~mgk25/unicode.html for a nice summary of
the subject.

My first thought about using UTF-8 instead of eight bit characters was
that we would have to replace the current `Meta' system.  However, I
don't think we do since the current system will seamlessly translate
from UTF-8 input to UTF-8 output.

Therefore, all we have to do is modify the shell's internals at the
point where it actually compares characters --- or, more generally,
tries to turn metafied sequences into a single character --- to use the
normal UTF8 rules.  There may also be some extra places where counting
the length needs changing.

Unicode characters are up to 6 bytes, so either with 64-bit integers we
can do a direct comparison some bit arithmetic, or we can just use
strncmp.  (I don't fancy relying on internationalisation support for
this this but in principle that's probably the right thing to do.)
Hence I don't see the necessity for actually decoding UTF-8 into Unicode
at any point, just deciding the number of bytes.  Not doing this avoids
problems with overlong encodings (ones which illegally represent a
character using too many bytes): an overlong encoding will always
compare differently to the standard encoding.

Probably we need a configuration option to switch this on or off.

Zle might be a bit more of a problem.  The web page I referred to above
gives the hopeful message that all encoding to/decoding from UTF-8 at
the terminal is handled by the terminal driver.  So for zle we have to
worry about things like
- determining whether the terminal is actually in UTF-8 mode, probably
  from the locale
- how UTF-8 encoded characters interfere with meta-bindings.  May be
  good enough simply not to use these, at least while we work out what's
  what
- reading multi-byte characters --- timeouts and the like
- getting the right length for displaying, deleting, copying
  etc. multi-byte characters.  Apart from counting continutation
  bytes, we may be stuck with using wcwidth for display.  This is a pain
  because it involves explicity wchar_t's, and I have no experience at
  all with these (except that they mess up compilation of otherwise trivial
  string-handling functions).
- all the stuff I've forgotten.

Any comments?

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR Ltd., Science Park, Milton Road,
Cambridge, CB4 0WH, UK                          Tel: +44 (0)1223 692070


**********************************************************************
The information transmitted is intended only for the person or
entity to which it is addressed and may contain confidential 
and/or privileged material. 
Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by 
persons or entities other than the intended recipient is 
prohibited.  
If you received this in error, please contact the sender and 
delete the material from any computer.
**********************************************************************


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2002-09-25 17:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-25 11:11 UTF-8 fonts Borzenkov Andrey
2002-09-25 11:36 ` Peter Stephenson
2002-09-25 13:27   ` Nadav Har'El
2002-09-25 17:29   ` Oliver Kiddle
2002-09-25 17:50     ` Peter Stephenson
  -- strict thread matches above, loose matches on Subject: below --
2002-09-19 16:56 Peter Stephenson
2002-09-19 18:14 ` Clint Adams
2002-09-24 13:39 ` Oliver Kiddle
2002-09-24 16:03   ` Clint Adams
2002-09-24 17:41     ` Peter Stephenson

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).