From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <zsh-workers-return-18509-mason-zsh=primenet.com.au@sunsite.dk>
Received: (qmail 20661 invoked from network); 7 May 2003 10:12:59 -0000
Received: from sunsite.dk (130.225.247.90)
  by ns1.primenet.com.au with SMTP; 7 May 2003 10:12:59 -0000
Received: (qmail 17323 invoked by alias); 7 May 2003 10:12:54 -0000
Mailing-List: contact zsh-workers-help@sunsite.dk; run by ezmlm
Precedence: bulk
X-No-Archive: yes
X-Seq: 18509
Received: (qmail 17316 invoked from network); 7 May 2003 10:12:54 -0000
Received: from localhost (HELO sunsite.dk) (127.0.0.1)
  by localhost with SMTP; 7 May 2003 10:12:54 -0000
X-MessageWall-Score: 0 (sunsite.dk)
Received: from [212.125.75.4] by sunsite.dk (MessageWall 1.0.8) with SMTP; 7 May 2003 10:12:54 -0000
Received: (qmail 8441 invoked from network); 7 May 2003 10:08:06 -0000
Received: from iris.logica.co.uk (158.234.9.163)
  by server-17.tower-1.messagelabs.com with SMTP; 7 May 2003 10:08:06 -0000
Received: from gmcs3.logica.co.uk ([158.234.142.61])
	by iris.logica.co.uk (8.9.3/8.9.3/Debian 8.9.3-21) with ESMTP id LAA29691;
	Wed, 7 May 2003 11:08:06 +0100
X-Authentication-Warning: iris.logica.co.uk: Host [158.234.142.61] claimed to be gmcs3.logica.co.uk
Received: from gmcs3.logica.co.uk (localhost [127.0.0.1])
	by gmcs3.logica.co.uk (8.11.6/8.11.6/SuSE Linux 0.5) with ESMTP id h47A8Hb03344;
	Wed, 7 May 2003 12:08:17 +0200
cc: zsh-workers@sunsite.dk
X-VirusChecked: Checked
In-reply-to: <20030430204105.GA13631@OZoNE.TZoNE.ORG> 
From: Oliver Kiddle <okiddle@yahoo.co.uk>
References: <20030429194325.GA843@OZoNE.TZoNE.ORG> <6134254DE87BD411908B00A0C99B044F05A0C904@mowd019a.mow.siemens.ru> <20030430204105.GA13631@OZoNE.TZoNE.ORG>
To: Phillip Vandry <vandry@TZoNE.ORG>
Subject: Re: Unicode support in Zle 
Date: Wed, 07 May 2003 12:08:17 +0200
Message-ID: <3342.1052302097@gmcs3.logica.co.uk>

On 30 Apr, Phillip Vandry wrote:

> One difference between what's suggested there and what I am doing is that
> I chose not to use the libc/locale functions such as wcwidth() and
> mblen(). It is debatable whether I should have, but I did this for a
> couple of reasons:

The main advantage of the libc functions is that they work for other
multi-byte encodings than utf-8. They also do a lot of work for you but
don't let that stop you reproducing it if you want.

> - To enable the functionality to work on systems where Unicode is not
> handled at all in the system's libc & ascociated libraries. I still

On the basis that such systems won't have utf-8 handling xterms,
filesystems or anything else, I'm sceptical about the value of that.

> use lots of older systems that run things like Solaris 2.5.1. These
> wouldn't be able to support it if I depended on the libraries. I
> will use the locale information from the environment as a hint to
> turn on UTF-8 mode, but you can also do it manually (currently by
> typing "setopt utf8"). The alternative to using libc functions is
> to use glib functions, but I don't really want to add glib to the soup.

I'd agree that adding glib into the mix would not be what we want. I'm
not sure that a utf8 option achieves anything. Assigning to LC_CTYPE
ought to be sufficient.

I'd add a --disable-multibyte option to configure to cut out support
though.

> - To convince myself that the handling of overlong UTF-8 encodings is
> handled securely to my satisfaction. Encoding a character in UTF-8
> with an overlong encoding can be a security problem (example:
> software attempts to purify filenames by stripping slashes and other
> special characters but misses [0xc0 0xaf], an overlong encoding of
> the slash character in UTF-8).

Would zsh code actually do the encoding anywhere as opposed to getting
it from the terminal or wherever else? I can't particularly think of an
example where an encoding wouldn't have come from an input somewhere.

> - Both the function to calculate the length in bytes of a UTF-8 character
> and its Unicode value and the function to guess whether a character
> occupies a double width cell are easy enough to implement in under
> 30 lines of code each.

Do these functions map fairly closely onto the libc equivalents. We
could perhaps apply them on systems where configure doesn't find
functions like wctomb in libc? So systems with wctomb and friends would
get a little less bloat and support for other multi-byte encodings.

Besides these comments, this all sounds very good. I look forward to
hearing about further progress.

Oliver

PS. Just in case it is any use to you, I've attached UCS4 to UTF-8
conversion code which I meant to put into the \u/\U code as a fallback
for systems like Solaris 8. I had to do a bit of searching to find
examples of this that were not GPL'd.

#  if defined(HAVE_NL_LANGINFO) && defined(CODESET)
		if (!strcmp(nl_langinfo(CODESET), "UTF-8")) {
		    int len;

		    if (wval < 0x80)
        	      len = 1;
		    else if (wval < 0x800)
        	      len = 2;
		    else if (wval < 0x10000)
        	      len = 3;
		    else if (wval < 0x200000)
        	      len = 4;
		    else if (wval < 0x4000000)
        	      len = 5;
		    else
        	      len = 6;
		  
		    switch (len) { /* falls through except to the last case */
        	    case 6: t[5] = (wval & 0x3f) | 0x80; wval >>= 6;
        	    case 5: t[4] = (wval & 0x3f) | 0x80; wval >>= 6;
        	    case 4: t[3] = (wval & 0x3f) | 0x80; wval >>= 6;
        	    case 3: t[2] = (wval & 0x3f) | 0x80; wval >>= 6;
        	    case 2: t[1] = (wval & 0x3f) | 0x80; wval >>= 6;
			*t = wval | (0xfc << (6 - len)) & 0xfc;
			break;
        	    case 1: *t = wval;
        	    }
		    t += len;
		    continue;
		}
#  endif