zsh-workers
 help / color / mirror / code / Atom feed
From: Borzenkov Andrey <Andrej.Borsenkow@mow.siemens.ru>
To: "'Zsh hackers list'" <zsh-workers@sunsite.dk>
Subject: RE: UTF-8 fonts
Date: Wed, 25 Sep 2002 15:11:39 +0400	[thread overview]
Message-ID: <6134254DE87BD411908B00A0C99B044F042E3E33@mowd019a.mow.siemens.ru> (raw)

Just to make it clear. Is the aim to use UTF-8 internally or to support
(arbitrary) multibyte encoding?

> See http://www.cl.cam.ac.uk/~mgk25/unicode.html for a nice summary of
> the subject.
> 
> My first thought about using UTF-8 instead of eight bit characters

this sounds like you want to convert input to UTF-8 internally?

> was
> that we would have to replace the current `Meta' system.  However, I
> don't think we do since the current system will seamlessly translate
> from UTF-8 input to UTF-8 output.
> 
> Therefore, all we have to do is modify the shell's internals at the
> point where it actually compares characters --- or, more generally,
> tries to turn metafied sequences into a single character --- to use the
> normal UTF8 rules.  There may also be some extra places where counting
> the length needs changing.
> 

You also need to modify any place where shell compares or translates (upper
<-> lower) characters. This is by definition locale dependent - collating
order is different is different languages even when they use the same
character set. Which means you can use UTF-8 (or, more generally, any
multibyte encoding) only if your current locale supports it. Which in effect
means using wc* and mb* function suite anyway.

But this also means you cannot assume anything about current character set
and cannot assume that it is transparent w.r.t. current string handling in
zsh.

> Unicode characters are up to 6 bytes, so either with 64-bit integers we
> can do a direct comparison some bit arithmetic, or we can just use
> strncmp.  (I don't fancy relying on internationalisation support for
> this this but in principle that's probably the right thing to do.)
> Hence I don't see the necessity for actually decoding UTF-8 into Unicode
> at any point, just deciding the number of bytes.  Not doing this avoids
> problems with overlong encodings (ones which illegally represent a
> character using too many bytes): an overlong encoding will always
> compare differently to the standard encoding.
> 

How do you know your input (and strings you are processing) are UTF-8?
Besides, standards do not provide a way to input multibyte character - you
can only read wide character.

> Probably we need a configuration option to switch this on or off.
> 

Yes, either we rely on standard locale support (and do not care what
character set is being used) or we must provide some OOB means to define
character set in use. 

> Zle might be a bit more of a problem.  The web page I referred to above
> gives the hopeful message that all encoding to/decoding from UTF-8 at
> the terminal is handled by the terminal driver.  So for zle we have to
> worry about things like
> - determining whether the terminal is actually in UTF-8 mode, probably
>   from the locale

Impossible. Local names are just arbitrary chosen strings; there is no
"character set code" defined in any locale definition, at least on Unix.

> - how UTF-8 encoded characters interfere with meta-bindings.  May be
>   good enough simply not to use these, at least while we work out what's
>   what
> - reading multi-byte characters --- timeouts and the like

use standard OS interfaces to read wide characters.

> - getting the right length for displaying, deleting, copying
>   etc. multi-byte characters.  Apart from counting continutation
>   bytes, we may be stuck with using wcwidth for display.  This is a pain
>   because it involves explicity wchar_t's, and I have no experience at
>   all with these (except that they mess up compilation of otherwise
> trivial
>   string-handling functions).
> - all the stuff I've forgotten.
> 
> Any comments?
> 

-andrey


             reply	other threads:[~2002-09-25 11:01 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-09-25 11:11 Borzenkov Andrey [this message]
2002-09-25 11:36 ` Peter Stephenson
2002-09-25 13:27   ` Nadav Har'El
2002-09-25 17:29   ` Oliver Kiddle
2002-09-25 17:50     ` Peter Stephenson
  -- strict thread matches above, loose matches on Subject: below --
2002-09-19 16:56 Peter Stephenson
2002-09-19 18:14 ` Clint Adams
2002-09-24 13:39 ` Oliver Kiddle
2002-09-24 16:03   ` Clint Adams
2002-09-24 17:41     ` Peter Stephenson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6134254DE87BD411908B00A0C99B044F042E3E33@mowd019a.mow.siemens.ru \
    --to=andrej.borsenkow@mow.siemens.ru \
    --cc=zsh-workers@sunsite.dk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).