From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <zsh-workers-return-18493-mason-zsh=primenet.com.au@sunsite.dk>
Received: (qmail 13192 invoked from network); 30 Apr 2003 20:41:16 -0000
Received: from sunsite.dk (130.225.247.90)
  by ns1.primenet.com.au with SMTP; 30 Apr 2003 20:41:16 -0000
Received: (qmail 1579 invoked by alias); 30 Apr 2003 20:41:10 -0000
Mailing-List: contact zsh-workers-help@sunsite.dk; run by ezmlm
Precedence: bulk
X-No-Archive: yes
X-Seq: 18493
Received: (qmail 1572 invoked from network); 30 Apr 2003 20:41:10 -0000
Received: from localhost (HELO sunsite.dk) (127.0.0.1)
  by localhost with SMTP; 30 Apr 2003 20:41:10 -0000
X-MessageWall-Score: 0 (sunsite.dk)
Received: from [209.104.74.2] by sunsite.dk (MessageWall 1.0.8) with SMTP; 30 Apr 2003 20:41:10 -0000
Received: from OZoNE.TZoNE.ORG (vandry@localhost [127.0.0.1])
	by OZoNE.TZoNE.ORG (8.12.3/8.12.3/Debian-5) with ESMTP id h3UKf6Vx014211;
	Wed, 30 Apr 2003 16:41:06 -0400
Received: (from vandry@localhost)
	by OZoNE.TZoNE.ORG (8.12.3/8.12.3/Debian-5) id h3UKf5FI014209;
	Wed, 30 Apr 2003 16:41:05 -0400
From: Phillip Vandry <vandry@TZoNE.ORG>
Date: Wed, 30 Apr 2003 16:41:05 -0400
To: Borzenkov Andrey <Andrey.Borzenkov@siemens.com>
Cc: zsh-workers@sunsite.dk
Subject: Re: Unicode support in Zle
Message-ID: <20030430204105.GA13631@OZoNE.TZoNE.ORG>
References: <20030429194325.GA843@OZoNE.TZoNE.ORG> <6134254DE87BD411908B00A0C99B044F05A0C904@mowd019a.mow.siemens.ru>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <6134254DE87BD411908B00A0C99B044F05A0C904@mowd019a.mow.siemens.ru>
User-Agent: Mutt/1.3.28i

On Wed, Apr 30, 2003 at 09:00:15AM +0400, Borzenkov Andrey wrote:
> 
> BTW get a look at this thread:
> http://www.zsh.org/mla/workers/2002/msg01165.html

I read the thread you pointed to in your subsequent message, good
background.

One difference between what's suggested there and what I am doing is that
I chose not to use the libc/locale functions such as wcwidth() and
mblen(). It is debatable whether I should have, but I did this for a
couple of reasons:

- To enable the functionality to work on systems where Unicode is not
handled at all in the system's libc & ascociated libraries. I still
use lots of older systems that run things like Solaris 2.5.1. These
wouldn't be able to support it if I depended on the libraries. I
will use the locale information from the environment as a hint to
turn on UTF-8 mode, but you can also do it manually (currently by
typing "setopt utf8"). The alternative to using libc functions is
to use glib functions, but I don't really want to add glib to the soup.

- To convince myself that the handling of overlong UTF-8 encodings is
handled securely to my satisfaction. Encoding a character in UTF-8
with an overlong encoding can be a security problem (example:
software attempts to purify filenames by stripping slashes and other
special characters but misses [0xc0 0xaf], an overlong encoding of
the slash character in UTF-8).

- Both the function to calculate the length in bytes of a UTF-8 character
and its Unicode value and the function to guess whether a character
occupies a double width cell are easy enough to implement in under
30 lines of code each.

Also as a comment on this thread, I agree that the Meta system will
continue to work unchanged. The characters which need metafication
were chosen so that they would not be likely to occur in normal
text. That's true for ASCII and ISO-8859-x. It's not so true for
UTF-8 so we will probably see more bytes that actually need to
be escaped, but it's not really a problem.

> Could you please give short description of your work? There are several

The line itself is kept encoded in UTF-8 to maximize compatibility.
Editing functions which work on characters are going to have to be
modified to check for multibyte characters. That's probably going to
sprinkle changes in many places.

The code in zle_refresh.c builds an image of the current lines being
edited for transfer onto the terminal. Because this code counts
character positions a lot to calculate where on the line updates have
to happen, where to move the cursor to, and so on, I decided that
this code needed to continue to have a fixed length code to work
with. Therefore the characters to be placed onto the sscreen are
kept in an array of 8 bit characters (as before) or 32 bit characters
(basically UCS-4) depending on whether Unicode mode is turned on.
I skip a slot in the array for double width characters.

> problems associated with mulitbyte locales and switching to UTF does not
> magically solve all of them (strictly speaking it solves none without
> further work).

However it does ease a transition. For code that modified a string,
the worst that non UTF-8 aware code can do is corrupt the invididual
character(s) it plays with, and for code reading or outputting a
string, the worst that non UTF-8 aware code can do it truncate
badly or calculate incorrectly either the length of the string in
characters or the display width.

> P.S. Do you have handy any links for i18n in other shells (bash or any
> other) or other text-processing programs? I had them once but lost as it
> seems.

As for bash, I am able to type non ASCII UTF-8 characters at it, but it
doesn't work great with double width characters, I think it assumes
everything is single width for its display calculations. As such, it
has served me when I wanted to type some Japanese text at the shell,
but I missed zsh.

-Phil