* Unicode support in Zle @ 2003-04-29 19:43 Phillip Vandry 2003-04-30 5:00 ` Borzenkov Andrey 0 siblings, 1 reply; 7+ messages in thread From: Phillip Vandry @ 2003-04-29 19:43 UTC (permalink / raw) To: zsh-workers Zsh workers, I would like to find out whether anyone is working on support for typing and editing Unicode characters in the ZLE (using UTF-8). I looked at the most recent development versions I could see and I didn't notice anything. Because I wanted the functionality I have already delved into the code to determine the feasibility of doing it without breaking anything. I will probably work what I've done already into a patch if indeed nobody is already doing this. I expect a first cut of it will be an approximately 1200 line patch, mostly to zle_main.c and zle_refresh.c. Thanks for your input. -Phil ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: Unicode support in Zle 2003-04-29 19:43 Unicode support in Zle Phillip Vandry @ 2003-04-30 5:00 ` Borzenkov Andrey 2003-04-30 6:14 ` Borzenkov Andrey 2003-04-30 20:41 ` Phillip Vandry 0 siblings, 2 replies; 7+ messages in thread From: Borzenkov Andrey @ 2003-04-30 5:00 UTC (permalink / raw) To: 'Phillip Vandry', zsh-workers > > I would like to find out whether anyone is working on support for > typing and editing Unicode characters in the ZLE (using UTF-8). > I looked at the most recent development versions I could see and > I didn't notice anything. > > Because I wanted the functionality I have already delved into the code > to determine the feasibility of doing it without breaking anything. > I will probably work what I've done already into a patch if indeed > nobody is already doing this. I expect a first cut of it will be an > approximately 1200 line patch, mostly to zle_main.c and zle_refresh.c. > I would be happy to join you. Mandrake 9.1 defaulted to UTF-8 on update and this was immediately visible with Zsh :) Could you please give short description of your work? There are several problems associated with mulitbyte locales and switching to UTF does not magically solve all of them (strictly speaking it solves none without further work). Thank you very much for your effort. -andrey P.S. Do you have handy any links for i18n in other shells (bash or any other) or other text-processing programs? I had them once but lost as it seems. ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: Unicode support in Zle 2003-04-30 5:00 ` Borzenkov Andrey @ 2003-04-30 6:14 ` Borzenkov Andrey 2003-04-30 20:41 ` Phillip Vandry 1 sibling, 0 replies; 7+ messages in thread From: Borzenkov Andrey @ 2003-04-30 6:14 UTC (permalink / raw) To: 'Phillip Vandry', zsh-workers > > P.S. Do you have handy any links for i18n in other shells (bash or any > other) or other text-processing programs? I had them once but lost as it > seems. BTW get a look at this thread: http://www.zsh.org/mla/workers/2002/msg01165.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Unicode support in Zle 2003-04-30 5:00 ` Borzenkov Andrey 2003-04-30 6:14 ` Borzenkov Andrey @ 2003-04-30 20:41 ` Phillip Vandry 2003-05-07 10:08 ` Oliver Kiddle 1 sibling, 1 reply; 7+ messages in thread From: Phillip Vandry @ 2003-04-30 20:41 UTC (permalink / raw) To: Borzenkov Andrey; +Cc: zsh-workers On Wed, Apr 30, 2003 at 09:00:15AM +0400, Borzenkov Andrey wrote: > > BTW get a look at this thread: > http://www.zsh.org/mla/workers/2002/msg01165.html I read the thread you pointed to in your subsequent message, good background. One difference between what's suggested there and what I am doing is that I chose not to use the libc/locale functions such as wcwidth() and mblen(). It is debatable whether I should have, but I did this for a couple of reasons: - To enable the functionality to work on systems where Unicode is not handled at all in the system's libc & ascociated libraries. I still use lots of older systems that run things like Solaris 2.5.1. These wouldn't be able to support it if I depended on the libraries. I will use the locale information from the environment as a hint to turn on UTF-8 mode, but you can also do it manually (currently by typing "setopt utf8"). The alternative to using libc functions is to use glib functions, but I don't really want to add glib to the soup. - To convince myself that the handling of overlong UTF-8 encodings is handled securely to my satisfaction. Encoding a character in UTF-8 with an overlong encoding can be a security problem (example: software attempts to purify filenames by stripping slashes and other special characters but misses [0xc0 0xaf], an overlong encoding of the slash character in UTF-8). - Both the function to calculate the length in bytes of a UTF-8 character and its Unicode value and the function to guess whether a character occupies a double width cell are easy enough to implement in under 30 lines of code each. Also as a comment on this thread, I agree that the Meta system will continue to work unchanged. The characters which need metafication were chosen so that they would not be likely to occur in normal text. That's true for ASCII and ISO-8859-x. It's not so true for UTF-8 so we will probably see more bytes that actually need to be escaped, but it's not really a problem. > Could you please give short description of your work? There are several The line itself is kept encoded in UTF-8 to maximize compatibility. Editing functions which work on characters are going to have to be modified to check for multibyte characters. That's probably going to sprinkle changes in many places. The code in zle_refresh.c builds an image of the current lines being edited for transfer onto the terminal. Because this code counts character positions a lot to calculate where on the line updates have to happen, where to move the cursor to, and so on, I decided that this code needed to continue to have a fixed length code to work with. Therefore the characters to be placed onto the sscreen are kept in an array of 8 bit characters (as before) or 32 bit characters (basically UCS-4) depending on whether Unicode mode is turned on. I skip a slot in the array for double width characters. > problems associated with mulitbyte locales and switching to UTF does not > magically solve all of them (strictly speaking it solves none without > further work). However it does ease a transition. For code that modified a string, the worst that non UTF-8 aware code can do is corrupt the invididual character(s) it plays with, and for code reading or outputting a string, the worst that non UTF-8 aware code can do it truncate badly or calculate incorrectly either the length of the string in characters or the display width. > P.S. Do you have handy any links for i18n in other shells (bash or any > other) or other text-processing programs? I had them once but lost as it > seems. As for bash, I am able to type non ASCII UTF-8 characters at it, but it doesn't work great with double width characters, I think it assumes everything is single width for its display calculations. As such, it has served me when I wanted to type some Japanese text at the shell, but I missed zsh. -Phil ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Unicode support in Zle 2003-04-30 20:41 ` Phillip Vandry @ 2003-05-07 10:08 ` Oliver Kiddle 2003-05-07 10:45 ` Peter Stephenson 2003-05-14 19:55 ` Phillip Vandry 0 siblings, 2 replies; 7+ messages in thread From: Oliver Kiddle @ 2003-05-07 10:08 UTC (permalink / raw) To: Phillip Vandry; +Cc: zsh-workers On 30 Apr, Phillip Vandry wrote: > One difference between what's suggested there and what I am doing is that > I chose not to use the libc/locale functions such as wcwidth() and > mblen(). It is debatable whether I should have, but I did this for a > couple of reasons: The main advantage of the libc functions is that they work for other multi-byte encodings than utf-8. They also do a lot of work for you but don't let that stop you reproducing it if you want. > - To enable the functionality to work on systems where Unicode is not > handled at all in the system's libc & ascociated libraries. I still On the basis that such systems won't have utf-8 handling xterms, filesystems or anything else, I'm sceptical about the value of that. > use lots of older systems that run things like Solaris 2.5.1. These > wouldn't be able to support it if I depended on the libraries. I > will use the locale information from the environment as a hint to > turn on UTF-8 mode, but you can also do it manually (currently by > typing "setopt utf8"). The alternative to using libc functions is > to use glib functions, but I don't really want to add glib to the soup. I'd agree that adding glib into the mix would not be what we want. I'm not sure that a utf8 option achieves anything. Assigning to LC_CTYPE ought to be sufficient. I'd add a --disable-multibyte option to configure to cut out support though. > - To convince myself that the handling of overlong UTF-8 encodings is > handled securely to my satisfaction. Encoding a character in UTF-8 > with an overlong encoding can be a security problem (example: > software attempts to purify filenames by stripping slashes and other > special characters but misses [0xc0 0xaf], an overlong encoding of > the slash character in UTF-8). Would zsh code actually do the encoding anywhere as opposed to getting it from the terminal or wherever else? I can't particularly think of an example where an encoding wouldn't have come from an input somewhere. > - Both the function to calculate the length in bytes of a UTF-8 character > and its Unicode value and the function to guess whether a character > occupies a double width cell are easy enough to implement in under > 30 lines of code each. Do these functions map fairly closely onto the libc equivalents. We could perhaps apply them on systems where configure doesn't find functions like wctomb in libc? So systems with wctomb and friends would get a little less bloat and support for other multi-byte encodings. Besides these comments, this all sounds very good. I look forward to hearing about further progress. Oliver PS. Just in case it is any use to you, I've attached UCS4 to UTF-8 conversion code which I meant to put into the \u/\U code as a fallback for systems like Solaris 8. I had to do a bit of searching to find examples of this that were not GPL'd. # if defined(HAVE_NL_LANGINFO) && defined(CODESET) if (!strcmp(nl_langinfo(CODESET), "UTF-8")) { int len; if (wval < 0x80) len = 1; else if (wval < 0x800) len = 2; else if (wval < 0x10000) len = 3; else if (wval < 0x200000) len = 4; else if (wval < 0x4000000) len = 5; else len = 6; switch (len) { /* falls through except to the last case */ case 6: t[5] = (wval & 0x3f) | 0x80; wval >>= 6; case 5: t[4] = (wval & 0x3f) | 0x80; wval >>= 6; case 4: t[3] = (wval & 0x3f) | 0x80; wval >>= 6; case 3: t[2] = (wval & 0x3f) | 0x80; wval >>= 6; case 2: t[1] = (wval & 0x3f) | 0x80; wval >>= 6; *t = wval | (0xfc << (6 - len)) & 0xfc; break; case 1: *t = wval; } t += len; continue; } # endif ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Unicode support in Zle 2003-05-07 10:08 ` Oliver Kiddle @ 2003-05-07 10:45 ` Peter Stephenson 2003-05-14 19:55 ` Phillip Vandry 1 sibling, 0 replies; 7+ messages in thread From: Peter Stephenson @ 2003-05-07 10:45 UTC (permalink / raw) To: zsh-workers Oliver Kiddle wrote: > I'm not sure that a utf8 option achieves anything. Assigning to > LC_CTYPE ought to be sufficient. It does give the user a way of testing whether zsh has really gone into UTF8 mode, however --- otherwise, how do you know if --disable-multibyte wasn't used for compilation owing to a buggy library, or whatever? Hence I was thinking of the same thing. (Whether it's a good idea to have it override the internal check for UTF8 is quite another question.) -- Peter Stephenson <pws@csr.com> Software Engineer CSR Ltd., Science Park, Milton Road, Cambridge, CB4 0WH, UK Tel: +44 (0)1223 692070 ********************************************************************** The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. ********************************************************************** ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Unicode support in Zle 2003-05-07 10:08 ` Oliver Kiddle 2003-05-07 10:45 ` Peter Stephenson @ 2003-05-14 19:55 ` Phillip Vandry 1 sibling, 0 replies; 7+ messages in thread From: Phillip Vandry @ 2003-05-14 19:55 UTC (permalink / raw) To: Oliver Kiddle; +Cc: zsh-workers On Wed, May 07, 2003 at 12:08:17PM +0200, Oliver Kiddle wrote: > The main advantage of the libc functions is that they work for other > multi-byte encodings than utf-8. They also do a lot of work for you but > don't let that stop you reproducing it if you want. You're right. Actually I believe I'm going to switch to the libc functions, with compatibility functions conditionally compiled if they're not available. > > - To enable the functionality to work on systems where Unicode is not > > handled at all in the system's libc & ascociated libraries. I still > > On the basis that such systems won't have utf-8 handling xterms, > filesystems or anything else, I'm sceptical about the value of that. I am almost always remotely logged in to such systems, with my xterm & fonts available on the local machine. So there is value. > Besides these comments, this all sounds very good. I look forward to > hearing about further progress. The biggest issue I've run into is [re]drawing the command line on the screen if wide characters are used. Zsh tries to use termcap tricks like deleting and inserting characters, but I found that different terminals react to these commands in different ways for wide characters. For example in xterm you need to delete two characters to delete a wide character. In mlterm deleting a single character if the cursor is on a wide character deletes it, and shifts the text by two cells. Currently I deal with this unpredictability by suppressing the delete character functionality if there are wide characters. I can't wait to find out how different terminals handle combining characters! -Phil ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2003-05-14 19:55 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2003-04-29 19:43 Unicode support in Zle Phillip Vandry 2003-04-30 5:00 ` Borzenkov Andrey 2003-04-30 6:14 ` Borzenkov Andrey 2003-04-30 20:41 ` Phillip Vandry 2003-05-07 10:08 ` Oliver Kiddle 2003-05-07 10:45 ` Peter Stephenson 2003-05-14 19:55 ` Phillip Vandry
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).