zsh-workers
 help / color / mirror / code / Atom feed
From: Oliver Kiddle <okiddle@yahoo.co.uk>
To: Zsh-workers <zsh-workers@sunsite.dk>
Subject: Re: UTF-8 support
Date: Tue, 05 Oct 2004 13:01:32 +0200	[thread overview]
Message-ID: <29214.1096974092@trentino.logica.co.uk> (raw)
In-Reply-To: <200410041620.i94GKNro006000@news01.csr.com>

Peter wrote:
> I came to the conclusion that was going to be very time consuming --- it
> means unmetafying potentially a long string (we don't know where the
> characters end) and calling a function every time we want to compare multibyte
> characters.  Doing it only for UTF-8 can be optimised to work with
> extensions to the current tests; it's simple to test for the length of a
> UTF-8 character (although some error checking is also necessary).

If you want to find a short string in a long string you can surely
metafy the short string instead of unmetafying the long string.

The approach I was suggesting has the big advantage that we can add
support in isolated areas without first breaking the entire shell.

I think it would be bad a mistake to rewrite our own, UTF-8 specific
versions of all the routines that libc already provides. Even if we can
make one or two slightly more efficient by handling the meta process at
the same time. And if we're going to restrict the code to UTF-8, we
could ditch the meta stuff and use overcoding. This amounts to storing
the null character as the overlong two byte sequence c080. The code for
that would be a lot simpler but you can't expect to pass overlong
sequences elsewhere without getting errors. At least UTF-8 allows you to
strchr for 7-bit ASCII characters in a UTF-8 string (other multi-byte
encodings allow this only for /). Can we perhaps change the Meta
character to 0xc0. We can then use overcoding for UTF-8 but make the
UTF-8 specific code much more minimal in the metafy process.

The most efficient way would be to maintain string lengths, Pascal style
(length in bytes not characters). Possibly even using wchar_t instead
of multi-byte encodings. We could perhaps do that for limited sections
of code such as parameters. That would cope better when someone decides
to change the current locale. If we extend that elsewhere, we need to
be careful if we want to maintain portability of word code files, however.

> Given that the whole point of Unicode is to replace all other schemes,
> I'm not so keen about supporting other schemes if it's that much less
> efficient.

I'm not suggesting supporting alternatives to Unicode but alternatives
to UTF-8. I'd bet that single-byte 8-bit encodings will stick around on
small or embedded systems for longer than you might expect. My main
objection is to any suggestion of not using library calls to handle the
work. mblen may be easy to reimplement but wcwidth is not so we'd end up
with a mixture.

I don't mind so much whether we support other multibyte encodings with
more limited ASCII compatibility than UTF-8. It'd be better to have
limited support than an error message followed by setting LC_CTYPE to
C, though.

Oliver


  reply	other threads:[~2004-10-05 11:03 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-09-30  8:29 David Gómez
2004-09-30  9:24 ` Peter Stephenson
2004-10-01 18:41 ` David Gómez
2004-10-01 19:46   ` Oliver Kiddle
2004-10-04 16:08     ` David Gómez
2004-10-04 16:15       ` Clint Adams
2004-10-05 11:13       ` Oliver Kiddle
2004-10-04 16:20     ` Peter Stephenson
2004-10-05 11:01       ` Oliver Kiddle [this message]
2004-10-05 11:32         ` Peter Stephenson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=29214.1096974092@trentino.logica.co.uk \
    --to=okiddle@yahoo.co.uk \
    --cc=zsh-workers@sunsite.dk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).