Re: Unicode support in Zle

zsh-workers
 help / color / mirror / code / Atom feed

From: Oliver Kiddle <okiddle@yahoo.co.uk>
To: Phillip Vandry <vandry@TZoNE.ORG>
Cc: zsh-workers@sunsite.dk
Subject: Re: Unicode support in Zle
Date: Wed, 07 May 2003 12:08:17 +0200	[thread overview]
Message-ID: <3342.1052302097@gmcs3.logica.co.uk> (raw)
In-Reply-To: <20030430204105.GA13631@OZoNE.TZoNE.ORG>

On 30 Apr, Phillip Vandry wrote:

> One difference between what's suggested there and what I am doing is that
> I chose not to use the libc/locale functions such as wcwidth() and
> mblen(). It is debatable whether I should have, but I did this for a
> couple of reasons:

The main advantage of the libc functions is that they work for other
multi-byte encodings than utf-8. They also do a lot of work for you but
don't let that stop you reproducing it if you want.

> - To enable the functionality to work on systems where Unicode is not
> handled at all in the system's libc & ascociated libraries. I still

On the basis that such systems won't have utf-8 handling xterms,
filesystems or anything else, I'm sceptical about the value of that.

> use lots of older systems that run things like Solaris 2.5.1. These
> wouldn't be able to support it if I depended on the libraries. I
> will use the locale information from the environment as a hint to
> turn on UTF-8 mode, but you can also do it manually (currently by
> typing "setopt utf8"). The alternative to using libc functions is
> to use glib functions, but I don't really want to add glib to the soup.

I'd agree that adding glib into the mix would not be what we want. I'm
not sure that a utf8 option achieves anything. Assigning to LC_CTYPE
ought to be sufficient.

I'd add a --disable-multibyte option to configure to cut out support
though.

> - To convince myself that the handling of overlong UTF-8 encodings is
> handled securely to my satisfaction. Encoding a character in UTF-8
> with an overlong encoding can be a security problem (example:
> software attempts to purify filenames by stripping slashes and other
> special characters but misses [0xc0 0xaf], an overlong encoding of
> the slash character in UTF-8).

Would zsh code actually do the encoding anywhere as opposed to getting
it from the terminal or wherever else? I can't particularly think of an
example where an encoding wouldn't have come from an input somewhere.

> - Both the function to calculate the length in bytes of a UTF-8 character
> and its Unicode value and the function to guess whether a character
> occupies a double width cell are easy enough to implement in under
> 30 lines of code each.

Do these functions map fairly closely onto the libc equivalents. We
could perhaps apply them on systems where configure doesn't find
functions like wctomb in libc? So systems with wctomb and friends would
get a little less bloat and support for other multi-byte encodings.

Besides these comments, this all sounds very good. I look forward to
hearing about further progress.

Oliver

PS. Just in case it is any use to you, I've attached UCS4 to UTF-8
conversion code which I meant to put into the \u/\U code as a fallback
for systems like Solaris 8. I had to do a bit of searching to find
examples of this that were not GPL'd.

#  if defined(HAVE_NL_LANGINFO) && defined(CODESET)
		if (!strcmp(nl_langinfo(CODESET), "UTF-8")) {
		    int len;

		    if (wval < 0x80)
        	      len = 1;
		    else if (wval < 0x800)
        	      len = 2;
		    else if (wval < 0x10000)
        	      len = 3;
		    else if (wval < 0x200000)
        	      len = 4;
		    else if (wval < 0x4000000)
        	      len = 5;
		    else
        	      len = 6;

		    switch (len) { /* falls through except to the last case */
        	    case 6: t[5] = (wval & 0x3f) | 0x80; wval >>= 6;
        	    case 5: t[4] = (wval & 0x3f) | 0x80; wval >>= 6;
        	    case 4: t[3] = (wval & 0x3f) | 0x80; wval >>= 6;
        	    case 3: t[2] = (wval & 0x3f) | 0x80; wval >>= 6;
        	    case 2: t[1] = (wval & 0x3f) | 0x80; wval >>= 6;
			*t = wval | (0xfc << (6 - len)) & 0xfc;
			break;
        	    case 1: *t = wval;
        	    }
		    t += len;
		    continue;
		}
#  endif

next prev parent reply	other threads:[~2003-05-07 10:12 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-04-29 19:43 Phillip Vandry
2003-04-30  5:00 ` Borzenkov Andrey
2003-04-30  6:14   ` Borzenkov Andrey
2003-04-30 20:41   ` Phillip Vandry
2003-05-07 10:08     ` Oliver Kiddle [this message]
2003-05-07 10:45       ` Peter Stephenson
2003-05-14 19:55       ` Phillip Vandry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3342.1052302097@gmcs3.logica.co.uk \
    --to=okiddle@yahoo.co.uk \
    --cc=vandry@TZoNE.ORG \
    --cc=zsh-workers@sunsite.dk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).