From: Borzenkov Andrey <Andrej.Borsenkow@mow.siemens.ru>
To: "'Zsh hackers list'" <zsh-workers@sunsite.dk>
Subject: RE: UTF-8 fonts
Date: Wed, 25 Sep 2002 15:11:39 +0400 [thread overview]
Message-ID: <6134254DE87BD411908B00A0C99B044F042E3E33@mowd019a.mow.siemens.ru> (raw)
Just to make it clear. Is the aim to use UTF-8 internally or to support
(arbitrary) multibyte encoding?
> See http://www.cl.cam.ac.uk/~mgk25/unicode.html for a nice summary of
> the subject.
>
> My first thought about using UTF-8 instead of eight bit characters
this sounds like you want to convert input to UTF-8 internally?
> was
> that we would have to replace the current `Meta' system. However, I
> don't think we do since the current system will seamlessly translate
> from UTF-8 input to UTF-8 output.
>
> Therefore, all we have to do is modify the shell's internals at the
> point where it actually compares characters --- or, more generally,
> tries to turn metafied sequences into a single character --- to use the
> normal UTF8 rules. There may also be some extra places where counting
> the length needs changing.
>
You also need to modify any place where shell compares or translates (upper
<-> lower) characters. This is by definition locale dependent - collating
order is different is different languages even when they use the same
character set. Which means you can use UTF-8 (or, more generally, any
multibyte encoding) only if your current locale supports it. Which in effect
means using wc* and mb* function suite anyway.
But this also means you cannot assume anything about current character set
and cannot assume that it is transparent w.r.t. current string handling in
zsh.
> Unicode characters are up to 6 bytes, so either with 64-bit integers we
> can do a direct comparison some bit arithmetic, or we can just use
> strncmp. (I don't fancy relying on internationalisation support for
> this this but in principle that's probably the right thing to do.)
> Hence I don't see the necessity for actually decoding UTF-8 into Unicode
> at any point, just deciding the number of bytes. Not doing this avoids
> problems with overlong encodings (ones which illegally represent a
> character using too many bytes): an overlong encoding will always
> compare differently to the standard encoding.
>
How do you know your input (and strings you are processing) are UTF-8?
Besides, standards do not provide a way to input multibyte character - you
can only read wide character.
> Probably we need a configuration option to switch this on or off.
>
Yes, either we rely on standard locale support (and do not care what
character set is being used) or we must provide some OOB means to define
character set in use.
> Zle might be a bit more of a problem. The web page I referred to above
> gives the hopeful message that all encoding to/decoding from UTF-8 at
> the terminal is handled by the terminal driver. So for zle we have to
> worry about things like
> - determining whether the terminal is actually in UTF-8 mode, probably
> from the locale
Impossible. Local names are just arbitrary chosen strings; there is no
"character set code" defined in any locale definition, at least on Unix.
> - how UTF-8 encoded characters interfere with meta-bindings. May be
> good enough simply not to use these, at least while we work out what's
> what
> - reading multi-byte characters --- timeouts and the like
use standard OS interfaces to read wide characters.
> - getting the right length for displaying, deleting, copying
> etc. multi-byte characters. Apart from counting continutation
> bytes, we may be stuck with using wcwidth for display. This is a pain
> because it involves explicity wchar_t's, and I have no experience at
> all with these (except that they mess up compilation of otherwise
> trivial
> string-handling functions).
> - all the stuff I've forgotten.
>
> Any comments?
>
-andrey
next reply other threads:[~2002-09-25 11:01 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-09-25 11:11 Borzenkov Andrey [this message]
2002-09-25 11:36 ` Peter Stephenson
2002-09-25 13:27 ` Nadav Har'El
2002-09-25 17:29 ` Oliver Kiddle
2002-09-25 17:50 ` Peter Stephenson
-- strict thread matches above, loose matches on Subject: below --
2002-09-19 16:56 Peter Stephenson
2002-09-19 18:14 ` Clint Adams
2002-09-24 13:39 ` Oliver Kiddle
2002-09-24 16:03 ` Clint Adams
2002-09-24 17:41 ` Peter Stephenson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6134254DE87BD411908B00A0C99B044F042E3E33@mowd019a.mow.siemens.ru \
--to=andrej.borsenkow@mow.siemens.ru \
--cc=zsh-workers@sunsite.dk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/zsh/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).