Unicode support in Zle

zsh-workers
 help / color / mirror / code / Atom feed

* Unicode support in Zle
@ 2003-04-29 19:43 Phillip Vandry
  2003-04-30  5:00 ` Borzenkov Andrey
  0 siblings, 1 reply; 7+ messages in thread
From: Phillip Vandry @ 2003-04-29 19:43 UTC (permalink / raw)
  To: zsh-workers

Zsh workers,

I would like to find out whether anyone is working on support for
typing and editing Unicode characters in the ZLE (using UTF-8).
I looked at the most recent development versions I could see and
I didn't notice anything.

Because I wanted the functionality I have already delved into the code
to determine the feasibility of doing it without breaking anything.
I will probably work what I've done already into a patch if indeed
nobody is already doing this. I expect a first cut of it will be an
approximately 1200 line patch, mostly to zle_main.c and zle_refresh.c.

Thanks for your input.

-Phil

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Unicode support in Zle
  2003-04-29 19:43 Unicode support in Zle Phillip Vandry
@ 2003-04-30  5:00 ` Borzenkov Andrey
  2003-04-30  6:14   ` Borzenkov Andrey
  2003-04-30 20:41   ` Phillip Vandry
  0 siblings, 2 replies; 7+ messages in thread
From: Borzenkov Andrey @ 2003-04-30  5:00 UTC (permalink / raw)
  To: 'Phillip Vandry', zsh-workers

> 
> I would like to find out whether anyone is working on support for
> typing and editing Unicode characters in the ZLE (using UTF-8).
> I looked at the most recent development versions I could see and
> I didn't notice anything.
> 
> Because I wanted the functionality I have already delved into the code
> to determine the feasibility of doing it without breaking anything.
> I will probably work what I've done already into a patch if indeed
> nobody is already doing this. I expect a first cut of it will be an
> approximately 1200 line patch, mostly to zle_main.c and zle_refresh.c.
> 

I would be happy to join you. Mandrake 9.1 defaulted to UTF-8 on update and
this was immediately visible with Zsh :)

Could you please give short description of your work? There are several
problems associated with mulitbyte locales and switching to UTF does not
magically solve all of them (strictly speaking it solves none without
further work).

Thank you very much for your effort.

-andrey

P.S. Do you have handy any links for i18n in other shells (bash or any
other) or other text-processing programs? I had them once but lost as it
seems.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Unicode support in Zle
  2003-04-30  5:00 ` Borzenkov Andrey
@ 2003-04-30  6:14   ` Borzenkov Andrey
  2003-04-30 20:41   ` Phillip Vandry
  1 sibling, 0 replies; 7+ messages in thread
From: Borzenkov Andrey @ 2003-04-30  6:14 UTC (permalink / raw)
  To: 'Phillip Vandry', zsh-workers


> 
> P.S. Do you have handy any links for i18n in other shells (bash or any
> other) or other text-processing programs? I had them once but lost as it
> seems.

BTW get a look at this thread:
http://www.zsh.org/mla/workers/2002/msg01165.html


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Unicode support in Zle
  2003-04-30  5:00 ` Borzenkov Andrey
  2003-04-30  6:14   ` Borzenkov Andrey
@ 2003-04-30 20:41   ` Phillip Vandry
  2003-05-07 10:08     ` Oliver Kiddle
  1 sibling, 1 reply; 7+ messages in thread
From: Phillip Vandry @ 2003-04-30 20:41 UTC (permalink / raw)
  To: Borzenkov Andrey; +Cc: zsh-workers

On Wed, Apr 30, 2003 at 09:00:15AM +0400, Borzenkov Andrey wrote:
> 
> BTW get a look at this thread:
> http://www.zsh.org/mla/workers/2002/msg01165.html

I read the thread you pointed to in your subsequent message, good
background.

One difference between what's suggested there and what I am doing is that
I chose not to use the libc/locale functions such as wcwidth() and
mblen(). It is debatable whether I should have, but I did this for a
couple of reasons:

- To enable the functionality to work on systems where Unicode is not
handled at all in the system's libc & ascociated libraries. I still
use lots of older systems that run things like Solaris 2.5.1. These
wouldn't be able to support it if I depended on the libraries. I
will use the locale information from the environment as a hint to
turn on UTF-8 mode, but you can also do it manually (currently by
typing "setopt utf8"). The alternative to using libc functions is
to use glib functions, but I don't really want to add glib to the soup.

- To convince myself that the handling of overlong UTF-8 encodings is
handled securely to my satisfaction. Encoding a character in UTF-8
with an overlong encoding can be a security problem (example:
software attempts to purify filenames by stripping slashes and other
special characters but misses [0xc0 0xaf], an overlong encoding of
the slash character in UTF-8).

- Both the function to calculate the length in bytes of a UTF-8 character
and its Unicode value and the function to guess whether a character
occupies a double width cell are easy enough to implement in under
30 lines of code each.

Also as a comment on this thread, I agree that the Meta system will
continue to work unchanged. The characters which need metafication
were chosen so that they would not be likely to occur in normal
text. That's true for ASCII and ISO-8859-x. It's not so true for
UTF-8 so we will probably see more bytes that actually need to
be escaped, but it's not really a problem.

> Could you please give short description of your work? There are several

The line itself is kept encoded in UTF-8 to maximize compatibility.
Editing functions which work on characters are going to have to be
modified to check for multibyte characters. That's probably going to
sprinkle changes in many places.

The code in zle_refresh.c builds an image of the current lines being
edited for transfer onto the terminal. Because this code counts
character positions a lot to calculate where on the line updates have
to happen, where to move the cursor to, and so on, I decided that
this code needed to continue to have a fixed length code to work
with. Therefore the characters to be placed onto the sscreen are
kept in an array of 8 bit characters (as before) or 32 bit characters
(basically UCS-4) depending on whether Unicode mode is turned on.
I skip a slot in the array for double width characters.

> problems associated with mulitbyte locales and switching to UTF does not
> magically solve all of them (strictly speaking it solves none without
> further work).

However it does ease a transition. For code that modified a string,
the worst that non UTF-8 aware code can do is corrupt the invididual
character(s) it plays with, and for code reading or outputting a
string, the worst that non UTF-8 aware code can do it truncate
badly or calculate incorrectly either the length of the string in
characters or the display width.

> P.S. Do you have handy any links for i18n in other shells (bash or any
> other) or other text-processing programs? I had them once but lost as it
> seems.

As for bash, I am able to type non ASCII UTF-8 characters at it, but it
doesn't work great with double width characters, I think it assumes
everything is single width for its display calculations. As such, it
has served me when I wanted to type some Japanese text at the shell,
but I missed zsh.

-Phil

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Unicode support in Zle
  2003-04-30 20:41   ` Phillip Vandry
@ 2003-05-07 10:08     ` Oliver Kiddle
  2003-05-07 10:45       ` Peter Stephenson
  2003-05-14 19:55       ` Phillip Vandry
  0 siblings, 2 replies; 7+ messages in thread
From: Oliver Kiddle @ 2003-05-07 10:08 UTC (permalink / raw)
  To: Phillip Vandry; +Cc: zsh-workers

On 30 Apr, Phillip Vandry wrote:

> One difference between what's suggested there and what I am doing is that
> I chose not to use the libc/locale functions such as wcwidth() and
> mblen(). It is debatable whether I should have, but I did this for a
> couple of reasons:

The main advantage of the libc functions is that they work for other
multi-byte encodings than utf-8. They also do a lot of work for you but
don't let that stop you reproducing it if you want.

> - To enable the functionality to work on systems where Unicode is not
> handled at all in the system's libc & ascociated libraries. I still

On the basis that such systems won't have utf-8 handling xterms,
filesystems or anything else, I'm sceptical about the value of that.

> use lots of older systems that run things like Solaris 2.5.1. These
> wouldn't be able to support it if I depended on the libraries. I
> will use the locale information from the environment as a hint to
> turn on UTF-8 mode, but you can also do it manually (currently by
> typing "setopt utf8"). The alternative to using libc functions is
> to use glib functions, but I don't really want to add glib to the soup.

I'd agree that adding glib into the mix would not be what we want. I'm
not sure that a utf8 option achieves anything. Assigning to LC_CTYPE
ought to be sufficient.

I'd add a --disable-multibyte option to configure to cut out support
though.

> - To convince myself that the handling of overlong UTF-8 encodings is
> handled securely to my satisfaction. Encoding a character in UTF-8
> with an overlong encoding can be a security problem (example:
> software attempts to purify filenames by stripping slashes and other
> special characters but misses [0xc0 0xaf], an overlong encoding of
> the slash character in UTF-8).

Would zsh code actually do the encoding anywhere as opposed to getting
it from the terminal or wherever else? I can't particularly think of an
example where an encoding wouldn't have come from an input somewhere.

> - Both the function to calculate the length in bytes of a UTF-8 character
> and its Unicode value and the function to guess whether a character
> occupies a double width cell are easy enough to implement in under
> 30 lines of code each.

Do these functions map fairly closely onto the libc equivalents. We
could perhaps apply them on systems where configure doesn't find
functions like wctomb in libc? So systems with wctomb and friends would
get a little less bloat and support for other multi-byte encodings.

Besides these comments, this all sounds very good. I look forward to
hearing about further progress.

Oliver

PS. Just in case it is any use to you, I've attached UCS4 to UTF-8
conversion code which I meant to put into the \u/\U code as a fallback
for systems like Solaris 8. I had to do a bit of searching to find
examples of this that were not GPL'd.

#  if defined(HAVE_NL_LANGINFO) && defined(CODESET)
		if (!strcmp(nl_langinfo(CODESET), "UTF-8")) {
		    int len;

		    if (wval < 0x80)
        	      len = 1;
		    else if (wval < 0x800)
        	      len = 2;
		    else if (wval < 0x10000)
        	      len = 3;
		    else if (wval < 0x200000)
        	      len = 4;
		    else if (wval < 0x4000000)
        	      len = 5;
		    else
        	      len = 6;

		    switch (len) { /* falls through except to the last case */
        	    case 6: t[5] = (wval & 0x3f) | 0x80; wval >>= 6;
        	    case 5: t[4] = (wval & 0x3f) | 0x80; wval >>= 6;
        	    case 4: t[3] = (wval & 0x3f) | 0x80; wval >>= 6;
        	    case 3: t[2] = (wval & 0x3f) | 0x80; wval >>= 6;
        	    case 2: t[1] = (wval & 0x3f) | 0x80; wval >>= 6;
			*t = wval | (0xfc << (6 - len)) & 0xfc;
			break;
        	    case 1: *t = wval;
        	    }
		    t += len;
		    continue;
		}
#  endif

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Unicode support in Zle
  2003-05-07 10:08     ` Oliver Kiddle
@ 2003-05-07 10:45       ` Peter Stephenson
  2003-05-14 19:55       ` Phillip Vandry
  1 sibling, 0 replies; 7+ messages in thread
From: Peter Stephenson @ 2003-05-07 10:45 UTC (permalink / raw)
  To: zsh-workers

Oliver Kiddle wrote:
> I'm not sure that a utf8 option achieves anything. Assigning to
> LC_CTYPE ought to be sufficient.

It does give the user a way of testing whether zsh has really gone into
UTF8 mode, however --- otherwise, how do you know if --disable-multibyte
wasn't used for compilation owing to a buggy library, or whatever?
Hence I was thinking of the same thing.  (Whether it's a good idea to
have it override the internal check for UTF8 is quite another question.)

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR Ltd., Science Park, Milton Road,
Cambridge, CB4 0WH, UK                          Tel: +44 (0)1223 692070

**********************************************************************
The information transmitted is intended only for the person or
entity to which it is addressed and may contain confidential 
and/or privileged material. 
Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by 
persons or entities other than the intended recipient is 
prohibited.  
If you received this in error, please contact the sender and 
delete the material from any computer.
**********************************************************************

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Unicode support in Zle
  2003-05-07 10:08     ` Oliver Kiddle
  2003-05-07 10:45       ` Peter Stephenson
@ 2003-05-14 19:55       ` Phillip Vandry
  1 sibling, 0 replies; 7+ messages in thread
From: Phillip Vandry @ 2003-05-14 19:55 UTC (permalink / raw)
  To: Oliver Kiddle; +Cc: zsh-workers

On Wed, May 07, 2003 at 12:08:17PM +0200, Oliver Kiddle wrote:
> The main advantage of the libc functions is that they work for other
> multi-byte encodings than utf-8. They also do a lot of work for you but
> don't let that stop you reproducing it if you want.

You're right. Actually I believe I'm going to switch to the libc functions,
with compatibility functions conditionally compiled if they're not
available.

> > - To enable the functionality to work on systems where Unicode is not
> > handled at all in the system's libc & ascociated libraries. I still
> 
> On the basis that such systems won't have utf-8 handling xterms,
> filesystems or anything else, I'm sceptical about the value of that.

I am almost always remotely logged in to such systems, with my xterm &
fonts available on the local machine. So there is value.

> Besides these comments, this all sounds very good. I look forward to
> hearing about further progress.

The biggest issue I've run into is [re]drawing the command line on the
screen if wide characters are used. Zsh tries to use termcap tricks like
deleting and inserting characters, but I found that different terminals
react to these commands in different ways for wide characters. For
example in xterm you need to delete two characters to delete a wide
character. In mlterm deleting a single character if the cursor is on a
wide character deletes it, and shifts the text by two cells. Currently
I deal with this unpredictability by suppressing the delete character
functionality if there are wide characters. I can't wait to find out
how different terminals handle combining characters!

-Phil

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-05-14 19:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-29 19:43 Unicode support in Zle Phillip Vandry
2003-04-30  5:00 ` Borzenkov Andrey
2003-04-30  6:14   ` Borzenkov Andrey
2003-04-30 20:41   ` Phillip Vandry
2003-05-07 10:08     ` Oliver Kiddle
2003-05-07 10:45       ` Peter Stephenson
2003-05-14 19:55       ` Phillip Vandry

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).