From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 13192 invoked from network); 30 Apr 2003 20:41:16 -0000 Received: from sunsite.dk (130.225.247.90) by ns1.primenet.com.au with SMTP; 30 Apr 2003 20:41:16 -0000 Received: (qmail 1579 invoked by alias); 30 Apr 2003 20:41:10 -0000 Mailing-List: contact zsh-workers-help@sunsite.dk; run by ezmlm Precedence: bulk X-No-Archive: yes X-Seq: 18493 Received: (qmail 1572 invoked from network); 30 Apr 2003 20:41:10 -0000 Received: from localhost (HELO sunsite.dk) (127.0.0.1) by localhost with SMTP; 30 Apr 2003 20:41:10 -0000 X-MessageWall-Score: 0 (sunsite.dk) Received: from [209.104.74.2] by sunsite.dk (MessageWall 1.0.8) with SMTP; 30 Apr 2003 20:41:10 -0000 Received: from OZoNE.TZoNE.ORG (vandry@localhost [127.0.0.1]) by OZoNE.TZoNE.ORG (8.12.3/8.12.3/Debian-5) with ESMTP id h3UKf6Vx014211; Wed, 30 Apr 2003 16:41:06 -0400 Received: (from vandry@localhost) by OZoNE.TZoNE.ORG (8.12.3/8.12.3/Debian-5) id h3UKf5FI014209; Wed, 30 Apr 2003 16:41:05 -0400 From: Phillip Vandry Date: Wed, 30 Apr 2003 16:41:05 -0400 To: Borzenkov Andrey Cc: zsh-workers@sunsite.dk Subject: Re: Unicode support in Zle Message-ID: <20030430204105.GA13631@OZoNE.TZoNE.ORG> References: <20030429194325.GA843@OZoNE.TZoNE.ORG> <6134254DE87BD411908B00A0C99B044F05A0C904@mowd019a.mow.siemens.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6134254DE87BD411908B00A0C99B044F05A0C904@mowd019a.mow.siemens.ru> User-Agent: Mutt/1.3.28i On Wed, Apr 30, 2003 at 09:00:15AM +0400, Borzenkov Andrey wrote: > > BTW get a look at this thread: > http://www.zsh.org/mla/workers/2002/msg01165.html I read the thread you pointed to in your subsequent message, good background. One difference between what's suggested there and what I am doing is that I chose not to use the libc/locale functions such as wcwidth() and mblen(). It is debatable whether I should have, but I did this for a couple of reasons: - To enable the functionality to work on systems where Unicode is not handled at all in the system's libc & ascociated libraries. I still use lots of older systems that run things like Solaris 2.5.1. These wouldn't be able to support it if I depended on the libraries. I will use the locale information from the environment as a hint to turn on UTF-8 mode, but you can also do it manually (currently by typing "setopt utf8"). The alternative to using libc functions is to use glib functions, but I don't really want to add glib to the soup. - To convince myself that the handling of overlong UTF-8 encodings is handled securely to my satisfaction. Encoding a character in UTF-8 with an overlong encoding can be a security problem (example: software attempts to purify filenames by stripping slashes and other special characters but misses [0xc0 0xaf], an overlong encoding of the slash character in UTF-8). - Both the function to calculate the length in bytes of a UTF-8 character and its Unicode value and the function to guess whether a character occupies a double width cell are easy enough to implement in under 30 lines of code each. Also as a comment on this thread, I agree that the Meta system will continue to work unchanged. The characters which need metafication were chosen so that they would not be likely to occur in normal text. That's true for ASCII and ISO-8859-x. It's not so true for UTF-8 so we will probably see more bytes that actually need to be escaped, but it's not really a problem. > Could you please give short description of your work? There are several The line itself is kept encoded in UTF-8 to maximize compatibility. Editing functions which work on characters are going to have to be modified to check for multibyte characters. That's probably going to sprinkle changes in many places. The code in zle_refresh.c builds an image of the current lines being edited for transfer onto the terminal. Because this code counts character positions a lot to calculate where on the line updates have to happen, where to move the cursor to, and so on, I decided that this code needed to continue to have a fixed length code to work with. Therefore the characters to be placed onto the sscreen are kept in an array of 8 bit characters (as before) or 32 bit characters (basically UCS-4) depending on whether Unicode mode is turned on. I skip a slot in the array for double width characters. > problems associated with mulitbyte locales and switching to UTF does not > magically solve all of them (strictly speaking it solves none without > further work). However it does ease a transition. For code that modified a string, the worst that non UTF-8 aware code can do is corrupt the invididual character(s) it plays with, and for code reading or outputting a string, the worst that non UTF-8 aware code can do it truncate badly or calculate incorrectly either the length of the string in characters or the display width. > P.S. Do you have handy any links for i18n in other shells (bash or any > other) or other text-processing programs? I had them once but lost as it > seems. As for bash, I am able to type non ASCII UTF-8 characters at it, but it doesn't work great with double width characters, I think it assumes everything is single width for its display calculations. As such, it has served me when I wanted to type some Japanese text at the shell, but I missed zsh. -Phil