From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 20661 invoked from network); 7 May 2003 10:12:59 -0000 Received: from sunsite.dk (130.225.247.90) by ns1.primenet.com.au with SMTP; 7 May 2003 10:12:59 -0000 Received: (qmail 17323 invoked by alias); 7 May 2003 10:12:54 -0000 Mailing-List: contact zsh-workers-help@sunsite.dk; run by ezmlm Precedence: bulk X-No-Archive: yes X-Seq: 18509 Received: (qmail 17316 invoked from network); 7 May 2003 10:12:54 -0000 Received: from localhost (HELO sunsite.dk) (127.0.0.1) by localhost with SMTP; 7 May 2003 10:12:54 -0000 X-MessageWall-Score: 0 (sunsite.dk) Received: from [212.125.75.4] by sunsite.dk (MessageWall 1.0.8) with SMTP; 7 May 2003 10:12:54 -0000 Received: (qmail 8441 invoked from network); 7 May 2003 10:08:06 -0000 Received: from iris.logica.co.uk (158.234.9.163) by server-17.tower-1.messagelabs.com with SMTP; 7 May 2003 10:08:06 -0000 Received: from gmcs3.logica.co.uk ([158.234.142.61]) by iris.logica.co.uk (8.9.3/8.9.3/Debian 8.9.3-21) with ESMTP id LAA29691; Wed, 7 May 2003 11:08:06 +0100 X-Authentication-Warning: iris.logica.co.uk: Host [158.234.142.61] claimed to be gmcs3.logica.co.uk Received: from gmcs3.logica.co.uk (localhost [127.0.0.1]) by gmcs3.logica.co.uk (8.11.6/8.11.6/SuSE Linux 0.5) with ESMTP id h47A8Hb03344; Wed, 7 May 2003 12:08:17 +0200 cc: zsh-workers@sunsite.dk X-VirusChecked: Checked In-reply-to: <20030430204105.GA13631@OZoNE.TZoNE.ORG> From: Oliver Kiddle References: <20030429194325.GA843@OZoNE.TZoNE.ORG> <6134254DE87BD411908B00A0C99B044F05A0C904@mowd019a.mow.siemens.ru> <20030430204105.GA13631@OZoNE.TZoNE.ORG> To: Phillip Vandry Subject: Re: Unicode support in Zle Date: Wed, 07 May 2003 12:08:17 +0200 Message-ID: <3342.1052302097@gmcs3.logica.co.uk> On 30 Apr, Phillip Vandry wrote: > One difference between what's suggested there and what I am doing is that > I chose not to use the libc/locale functions such as wcwidth() and > mblen(). It is debatable whether I should have, but I did this for a > couple of reasons: The main advantage of the libc functions is that they work for other multi-byte encodings than utf-8. They also do a lot of work for you but don't let that stop you reproducing it if you want. > - To enable the functionality to work on systems where Unicode is not > handled at all in the system's libc & ascociated libraries. I still On the basis that such systems won't have utf-8 handling xterms, filesystems or anything else, I'm sceptical about the value of that. > use lots of older systems that run things like Solaris 2.5.1. These > wouldn't be able to support it if I depended on the libraries. I > will use the locale information from the environment as a hint to > turn on UTF-8 mode, but you can also do it manually (currently by > typing "setopt utf8"). The alternative to using libc functions is > to use glib functions, but I don't really want to add glib to the soup. I'd agree that adding glib into the mix would not be what we want. I'm not sure that a utf8 option achieves anything. Assigning to LC_CTYPE ought to be sufficient. I'd add a --disable-multibyte option to configure to cut out support though. > - To convince myself that the handling of overlong UTF-8 encodings is > handled securely to my satisfaction. Encoding a character in UTF-8 > with an overlong encoding can be a security problem (example: > software attempts to purify filenames by stripping slashes and other > special characters but misses [0xc0 0xaf], an overlong encoding of > the slash character in UTF-8). Would zsh code actually do the encoding anywhere as opposed to getting it from the terminal or wherever else? I can't particularly think of an example where an encoding wouldn't have come from an input somewhere. > - Both the function to calculate the length in bytes of a UTF-8 character > and its Unicode value and the function to guess whether a character > occupies a double width cell are easy enough to implement in under > 30 lines of code each. Do these functions map fairly closely onto the libc equivalents. We could perhaps apply them on systems where configure doesn't find functions like wctomb in libc? So systems with wctomb and friends would get a little less bloat and support for other multi-byte encodings. Besides these comments, this all sounds very good. I look forward to hearing about further progress. Oliver PS. Just in case it is any use to you, I've attached UCS4 to UTF-8 conversion code which I meant to put into the \u/\U code as a fallback for systems like Solaris 8. I had to do a bit of searching to find examples of this that were not GPL'd. # if defined(HAVE_NL_LANGINFO) && defined(CODESET) if (!strcmp(nl_langinfo(CODESET), "UTF-8")) { int len; if (wval < 0x80) len = 1; else if (wval < 0x800) len = 2; else if (wval < 0x10000) len = 3; else if (wval < 0x200000) len = 4; else if (wval < 0x4000000) len = 5; else len = 6; switch (len) { /* falls through except to the last case */ case 6: t[5] = (wval & 0x3f) | 0x80; wval >>= 6; case 5: t[4] = (wval & 0x3f) | 0x80; wval >>= 6; case 4: t[3] = (wval & 0x3f) | 0x80; wval >>= 6; case 3: t[2] = (wval & 0x3f) | 0x80; wval >>= 6; case 2: t[1] = (wval & 0x3f) | 0x80; wval >>= 6; *t = wval | (0xfc << (6 - len)) & 0xfc; break; case 1: *t = wval; } t += len; continue; } # endif