zsh-workers
 help / color / mirror / code / Atom feed
* UTF-8 fonts
@ 2002-09-19 16:56 Peter Stephenson
  2002-09-19 18:14 ` Clint Adams
  2002-09-24 13:39 ` Oliver Kiddle
  0 siblings, 2 replies; 10+ messages in thread
From: Peter Stephenson @ 2002-09-19 16:56 UTC (permalink / raw)
  To: Zsh hackers list

See http://www.cl.cam.ac.uk/~mgk25/unicode.html for a nice summary of
the subject.

My first thought about using UTF-8 instead of eight bit characters was
that we would have to replace the current `Meta' system.  However, I
don't think we do since the current system will seamlessly translate
from UTF-8 input to UTF-8 output.

Therefore, all we have to do is modify the shell's internals at the
point where it actually compares characters --- or, more generally,
tries to turn metafied sequences into a single character --- to use the
normal UTF8 rules.  There may also be some extra places where counting
the length needs changing.

Unicode characters are up to 6 bytes, so either with 64-bit integers we
can do a direct comparison some bit arithmetic, or we can just use
strncmp.  (I don't fancy relying on internationalisation support for
this this but in principle that's probably the right thing to do.)
Hence I don't see the necessity for actually decoding UTF-8 into Unicode
at any point, just deciding the number of bytes.  Not doing this avoids
problems with overlong encodings (ones which illegally represent a
character using too many bytes): an overlong encoding will always
compare differently to the standard encoding.

Probably we need a configuration option to switch this on or off.

Zle might be a bit more of a problem.  The web page I referred to above
gives the hopeful message that all encoding to/decoding from UTF-8 at
the terminal is handled by the terminal driver.  So for zle we have to
worry about things like
- determining whether the terminal is actually in UTF-8 mode, probably
  from the locale
- how UTF-8 encoded characters interfere with meta-bindings.  May be
  good enough simply not to use these, at least while we work out what's
  what
- reading multi-byte characters --- timeouts and the like
- getting the right length for displaying, deleting, copying
  etc. multi-byte characters.  Apart from counting continutation
  bytes, we may be stuck with using wcwidth for display.  This is a pain
  because it involves explicity wchar_t's, and I have no experience at
  all with these (except that they mess up compilation of otherwise trivial
  string-handling functions).
- all the stuff I've forgotten.

Any comments?

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR Ltd., Science Park, Milton Road,
Cambridge, CB4 0WH, UK                          Tel: +44 (0)1223 692070


**********************************************************************
The information transmitted is intended only for the person or
entity to which it is addressed and may contain confidential 
and/or privileged material. 
Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by 
persons or entities other than the intended recipient is 
prohibited.  
If you received this in error, please contact the sender and 
delete the material from any computer.
**********************************************************************


^ permalink raw reply	[flat|nested] 10+ messages in thread
* RE: UTF-8 fonts
@ 2002-09-25 11:11 Borzenkov Andrey
  2002-09-25 11:36 ` Peter Stephenson
  0 siblings, 1 reply; 10+ messages in thread
From: Borzenkov Andrey @ 2002-09-25 11:11 UTC (permalink / raw)
  To: 'Zsh hackers list'

Just to make it clear. Is the aim to use UTF-8 internally or to support
(arbitrary) multibyte encoding?

> See http://www.cl.cam.ac.uk/~mgk25/unicode.html for a nice summary of
> the subject.
> 
> My first thought about using UTF-8 instead of eight bit characters

this sounds like you want to convert input to UTF-8 internally?

> was
> that we would have to replace the current `Meta' system.  However, I
> don't think we do since the current system will seamlessly translate
> from UTF-8 input to UTF-8 output.
> 
> Therefore, all we have to do is modify the shell's internals at the
> point where it actually compares characters --- or, more generally,
> tries to turn metafied sequences into a single character --- to use the
> normal UTF8 rules.  There may also be some extra places where counting
> the length needs changing.
> 

You also need to modify any place where shell compares or translates (upper
<-> lower) characters. This is by definition locale dependent - collating
order is different is different languages even when they use the same
character set. Which means you can use UTF-8 (or, more generally, any
multibyte encoding) only if your current locale supports it. Which in effect
means using wc* and mb* function suite anyway.

But this also means you cannot assume anything about current character set
and cannot assume that it is transparent w.r.t. current string handling in
zsh.

> Unicode characters are up to 6 bytes, so either with 64-bit integers we
> can do a direct comparison some bit arithmetic, or we can just use
> strncmp.  (I don't fancy relying on internationalisation support for
> this this but in principle that's probably the right thing to do.)
> Hence I don't see the necessity for actually decoding UTF-8 into Unicode
> at any point, just deciding the number of bytes.  Not doing this avoids
> problems with overlong encodings (ones which illegally represent a
> character using too many bytes): an overlong encoding will always
> compare differently to the standard encoding.
> 

How do you know your input (and strings you are processing) are UTF-8?
Besides, standards do not provide a way to input multibyte character - you
can only read wide character.

> Probably we need a configuration option to switch this on or off.
> 

Yes, either we rely on standard locale support (and do not care what
character set is being used) or we must provide some OOB means to define
character set in use. 

> Zle might be a bit more of a problem.  The web page I referred to above
> gives the hopeful message that all encoding to/decoding from UTF-8 at
> the terminal is handled by the terminal driver.  So for zle we have to
> worry about things like
> - determining whether the terminal is actually in UTF-8 mode, probably
>   from the locale

Impossible. Local names are just arbitrary chosen strings; there is no
"character set code" defined in any locale definition, at least on Unix.

> - how UTF-8 encoded characters interfere with meta-bindings.  May be
>   good enough simply not to use these, at least while we work out what's
>   what
> - reading multi-byte characters --- timeouts and the like

use standard OS interfaces to read wide characters.

> - getting the right length for displaying, deleting, copying
>   etc. multi-byte characters.  Apart from counting continutation
>   bytes, we may be stuck with using wcwidth for display.  This is a pain
>   because it involves explicity wchar_t's, and I have no experience at
>   all with these (except that they mess up compilation of otherwise
> trivial
>   string-handling functions).
> - all the stuff I've forgotten.
> 
> Any comments?
> 

-andrey


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2002-09-25 17:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-19 16:56 UTF-8 fonts Peter Stephenson
2002-09-19 18:14 ` Clint Adams
2002-09-24 13:39 ` Oliver Kiddle
2002-09-24 16:03   ` Clint Adams
2002-09-24 17:41     ` Peter Stephenson
2002-09-25 11:11 Borzenkov Andrey
2002-09-25 11:36 ` Peter Stephenson
2002-09-25 13:27   ` Nadav Har'El
2002-09-25 17:29   ` Oliver Kiddle
2002-09-25 17:50     ` Peter Stephenson

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).