zsh-workers
 help / color / mirror / code / Atom feed
* multibyte backwarddeletechar
@ 2001-10-21 15:42 Clint Adams
  2001-10-21 17:13 ` Bart Schaefer
  2001-10-22  0:57 ` Geoff Wing
  0 siblings, 2 replies; 11+ messages in thread
From: Clint Adams @ 2001-10-21 15:42 UTC (permalink / raw)
  To: zsh-workers

I don't intend to commit this patch as-is.

This causes backward-delete-char to delete entire
multibyte characters (valid in the current locale)
rather than their component octets.

For consistency's sake, similar hacks would apply to

backward-char
backward-delete-char
delete-char
forward-char
transpose-chars
vi-find-next-char
vi-backward-char
vi-backward-delete-char
vi-forward-char

Should these be replacements or a set of new widgets?

Index: zshconfig.ac
===================================================================
RCS file: /cvsroot/zsh/zsh/zshconfig.ac,v
retrieving revision 1.20
diff -u -r1.20 zshconfig.ac
--- zshconfig.ac	2001/10/10 16:02:24	1.20
+++ zshconfig.ac	2001/10/21 15:23:42
@@ -476,7 +476,7 @@
 		 limits.h fcntl.h libc.h sys/utsname.h sys/resource.h \
 		 locale.h errno.h stdlib.h unistd.h sys/capability.h \
 		 utmp.h utmpx.h sys/types.h pwd.h grp.h poll.h sys/mman.h \
-		 netinet/in_systm.h pcre.h)
+		 netinet/in_systm.h pcre.h wchar.h)
 if test $dynamic = yes; then
   AC_CHECK_HEADERS(dlfcn.h)
   AC_CHECK_HEADERS(dl.h)
@@ -938,7 +938,8 @@
 	       brk sbrk \
 	       pathconf sysconf \
 	       tgetent tigetflag tigetnum tigetstr setupterm \
-	       pcre_compile pcre_study pcre_exec)
+	       pcre_compile pcre_study pcre_exec \
+               mbtowc)
 AC_FUNC_STRCOLL
 
 dnl  Check if tgetent accepts NULL (and will allocate its own termcap buffer)
Index: Src/Zle/zle_misc.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/Zle/zle_misc.c,v
retrieving revision 1.6
diff -u -r1.6 zle_misc.c
--- Src/Zle/zle_misc.c	2001/09/03 01:39:20	1.6
+++ Src/Zle/zle_misc.c	2001/10/21 15:23:46
@@ -29,6 +29,9 @@
 
 #include "zle.mdh"
 #include "zle_misc.pro"
+#ifdef HAVE_MBTOWC
+# include <wchar.h>
+#endif
 
 /* insert a metafied string, with repetition and suffix removal */
 
@@ -105,6 +108,10 @@
 int
 backwarddeletechar(char **args)
 {
+#ifdef HAVE_MBTOWC
+    int i, j, k;
+    wchar_t pwc;
+#endif
     if (zmult < 0) {
 	int ret;
 	zmult = -zmult;
@@ -112,7 +119,19 @@
 	zmult = -zmult;
 	return ret;
     }
+#ifdef HAVE_MBTOWC
+    for(i=(zmult > cs) ? cs : zmult;i>0;i--) {
+	for(j=MB_CUR_MAX;j>0;j--) {
+	    k = mbtowc(&pwc, (char *)line+cs-j, j);
+	    if (k==j) {
+		backdel(j);
+		j = 0;
+	    }
+	}
+    }
+#else
     backdel(zmult > cs ? cs : zmult);
+#endif
     return 0;
 }
 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: multibyte backwarddeletechar
  2001-10-21 15:42 multibyte backwarddeletechar Clint Adams
@ 2001-10-21 17:13 ` Bart Schaefer
  2001-10-21 18:21   ` Clint Adams
  2001-10-22  0:57 ` Geoff Wing
  1 sibling, 1 reply; 11+ messages in thread
From: Bart Schaefer @ 2001-10-21 17:13 UTC (permalink / raw)
  To: zsh-workers

On Oct 21, 11:42am, Clint Adams wrote:
}
} This causes backward-delete-char to delete entire
} multibyte characters (valid in the current locale)
} rather than their component octets.

I'm a bit surprised that this wouldn't cause significant confusion in
the ZLE display code.  How did the multi-byte character get input in
the first place?  Is it displayed as occupying one character position
on the screen, or several?  If only one, doesn't the cursor end up in
the wrong place on most word- or line-oriented motions that cross it?

If we're going to support wide and/or multi-byte characters, I think we
should Do It Right, not by pasting a zillion workarounds into individual
editor functions.

-- 
Bart Schaefer                                 Brass Lantern Enterprises
http://www.well.com/user/barts              http://www.brasslantern.com

Zsh: http://www.zsh.org | PHPerl Project: http://phperl.sourceforge.net   


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: multibyte backwarddeletechar
  2001-10-21 17:13 ` Bart Schaefer
@ 2001-10-21 18:21   ` Clint Adams
  0 siblings, 0 replies; 11+ messages in thread
From: Clint Adams @ 2001-10-21 18:21 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: zsh-workers

> I'm a bit surprised that this wouldn't cause significant confusion in
> the ZLE display code.  How did the multi-byte character get input in
> the first place?  Is it displayed as occupying one character position
> on the screen, or several?  If only one, doesn't the cursor end up in
> the wrong place on most word- or line-oriented motions that cross it?

That depends on the terminal emulator and font.  If I run
LANG=zh_TW.Big5 crxvt -ls -fm taipei16 -fn 8x16 -km big5 ,
each BIG5 character (2 octets) appears to take up the
vertical space on one ASCII character, and horizontal space
of two ASCII characters.  If I run
LANG=zh_TW.Big5 crxvt -ls -fm taipei14 -fn 8x16 -km big5 ,
each BIG5 character (2 octets) appears to take up the
vertical space on one ASCII character, and horizontal space
of two and a half (2.5) ASCII characters, although crxvt
does some ugly overlapping resulting in ZLE not getting confused.
If I run LANG=ja_JP.UTF-8 xterm -class UXTerm ,
each UTF-8 Kanji character (3 octets) appears to take up
the same (2 horizontal, 1 vertical) space.  In this case,
ZLE does get horribly confused.  If I run
LANG=ru_RU.UTF-8 xterm -class UXTerm ,
each UTF-8 Cyrillic character (3 octets) appears to take
up the horizontal and vertical space of one ASCII character.
This also makes ZLE horribly confused.  If I run
LANG=fr_FR.UTF-8 xterm -class UXTerm ,
each UTF-8 French non-ASCII character (2 octets)
appears to take up the horizontal and vertical space of one
ASCII character.  Again, this confuses ZLE.

I imagine that 6-byte characters will generally take up
less horizontal space than 6 ASCII characters as well.

> If we're going to support wide and/or multi-byte characters, I think we
> should Do It Right, not by pasting a zillion workarounds into individual
> editor functions.

I suspect that Doing It Right involves changing char *line to
wchar_t *wline, and modifying all dependencies accordingly.
Additionally, we'd need to figure out how much space each
individual character consumes.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: multibyte backwarddeletechar
  2001-10-21 15:42 multibyte backwarddeletechar Clint Adams
  2001-10-21 17:13 ` Bart Schaefer
@ 2001-10-22  0:57 ` Geoff Wing
  2001-10-22  1:50   ` Clint Adams
  2001-10-22  5:20   ` Borsenkow Andrej
  1 sibling, 2 replies; 11+ messages in thread
From: Geoff Wing @ 2001-10-22  0:57 UTC (permalink / raw)
  To: Zsh Hackers

Clint Adams <clint@zsh.org> typed:
:I don't intend to commit this patch as-is.
:This causes backward-delete-char to delete entire
:multibyte characters (valid in the current locale)
:rather than their component octets.

Except that the ZLE refresh code can't handle it.  The code
only sees single characters and won't pass multibyte characters
through to the underlying terminal properly.

Background info for everyone: terminal emulators (I say "terminal
emulators" because I don't know of any terminals which can handle
multibyte glyphs but there may be some around) need multibyte glyphs
to be passed through atomically (considering only the character stream)
otherwise the terminal emulator can't know what constitutes part of
a glyph and what doesn't.  In common multibyte languages, this means
that pairs of characters (representing one glyph two characters wide)
must be passed through in sequence.

The ZLE refresh code quite happily writes the second half of multibyte
glyphs through out of context (i.e. without the first half) which would
corrupt terminal emulator displays.

My first thought is whether it is meaningful to use multibyte glyphs
on the command line.  And it may well be if, say, people name files using
multibyte glyphs and other programs (e.g. ls) display those names.
My second is whether we truly want to handle multibyte glyphs.  I don't
think minihacks will work.  It may be a major overhaul.  Not just the ZLE
refresh code would need updating but other areas too.  Of course, it
may not be as much work as I think but would definitely need some
discussion about what should and should not be handled.

Regards,
-- 
Geoff Wing | gcw@pobox.com | gcw@rxvt.org | gcw@zsh.org


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: multibyte backwarddeletechar
  2001-10-22  0:57 ` Geoff Wing
@ 2001-10-22  1:50   ` Clint Adams
  2001-10-22  3:23     ` Geoff Wing
  2001-10-22  5:20   ` Borsenkow Andrej
  1 sibling, 1 reply; 11+ messages in thread
From: Clint Adams @ 2001-10-22  1:50 UTC (permalink / raw)
  To: Geoff Wing; +Cc: Zsh Hackers

> My first thought is whether it is meaningful to use multibyte glyphs
> on the command line.  And it may well be if, say, people name files using
> multibyte glyphs and other programs (e.g. ls) display those names.
> My second is whether we truly want to handle multibyte glyphs.  I don't
> think minihacks will work.  It may be a major overhaul.  Not just the ZLE
> refresh code would need updating but other areas too.  Of course, it
> may not be as much work as I think but would definitely need some
> discussion about what should and should not be handled.

For one thing, the %D escape in my prompt is substituted with
multibyte glyphs in the relevant locales.  That in itself
poses a potential width calculation problem even if the user
doesn't input any multibyte characters.

As for the command line:

--8<--snip--8<--
To delete a multi-byte character you have to press backspace multiple times.
Beside normal irritation this can lead to you feeding non-conformant utf8
streams into programs expecting conformant utf8.

Hard to explain. The following is probally totaly messed up but I'm trying
to pipe the single letter 'latin small letter a with ring above' which is
0xE5 in iso8859-1 into od -x. The first time I get the real letter on the
command line and everything seems ok. The second time I get the right letter
on the command line and press backspace once before continuing with '|'.

plugh% echo -n å| od -x
0000000 a5c3
0000002
plugh% echo -n | od -x 
0000000 00c3
0000001
plugh% 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: multibyte backwarddeletechar
  2001-10-22  1:50   ` Clint Adams
@ 2001-10-22  3:23     ` Geoff Wing
  2001-10-22 11:27       ` Clint Adams
  0 siblings, 1 reply; 11+ messages in thread
From: Geoff Wing @ 2001-10-22  3:23 UTC (permalink / raw)
  To: Zsh Hackers; +Cc: Clint Adams

Clint Adams <clint@zsh.org> typed:
:For one thing, the %D escape in my prompt is substituted with
:multibyte glyphs in the relevant locales.  That in itself
:poses a potential width calculation problem even if the user
:doesn't input any multibyte characters.

I understand a lot of the limitations, having programmed from both
ends of the system (output to a terminal and interpret such output
in a terminal emulator).  And your initial list covers the basic
areas: deleting characters, basic cursor movement

But what happens when I do, say, "history-incremental-search-backward",
input characters for the second half a multibyte glyph, then do
"kill-line"?  What happens when I do, "down-case-word" when we only
consider byte by byte (which may easily corrupt the second byte of a
two-byte glyph)?  These are a couple off the top of my head.
There'll be quite a few more and I think they'd need to be tracked down
and considered before we make changes in this area.

I'm not trying to discourage anyone from looking at this and hope I
don't but would rather they have a broader overview so that if we
do change then we have a well planned method rather than trying to
hack each area separately.

Regards,
-- 
Geoff Wing | gcw@pobox.com | gcw@rxvt.org | gcw@zsh.org


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: multibyte backwarddeletechar
  2001-10-22  0:57 ` Geoff Wing
  2001-10-22  1:50   ` Clint Adams
@ 2001-10-22  5:20   ` Borsenkow Andrej
  2001-10-22 11:32     ` Clint Adams
  1 sibling, 1 reply; 11+ messages in thread
From: Borsenkow Andrej @ 2001-10-22  5:20 UTC (permalink / raw)
  To: 'Geoff Wing', 'Zsh Hackers'

> 
> My first thought is whether it is meaningful to use multibyte glyphs
> on the command line.  And it may well be if, say, people name files
using
> multibyte glyphs and other programs (e.g. ls) display those names.

Yes. We may argue that it is non-portable but you cannot force people to
use ASCII only.

> My second is whether we truly want to handle multibyte glyphs.  I
don't
> think minihacks will work.  It may be a major overhaul.  Not just the
ZLE
> refresh code would need updating but other areas too.


Of course. The whole string handling in zsh must be rewritten. Even
globbing won't work properly any more (`?' is not expected to match more
than one byte and character classes stop working).

The problem is it does mean overhead. I am not sure about proper
implementation. Using wchar looks portable but the immediate problem is
that conventional str* functions stop working. Using UTF-8 is appealing
due to ASCII compatibility but then you get a problem converting from/to
external charset (that implies reimplementing iconv layer for systems
that do have it natively).



-andrej


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: multibyte backwarddeletechar
  2001-10-22  3:23     ` Geoff Wing
@ 2001-10-22 11:27       ` Clint Adams
  0 siblings, 0 replies; 11+ messages in thread
From: Clint Adams @ 2001-10-22 11:27 UTC (permalink / raw)
  To: Geoff Wing; +Cc: Zsh Hackers

> But what happens when I do, say, "history-incremental-search-backward",
> input characters for the second half a multibyte glyph, then do
> "kill-line"?  What happens when I do, "down-case-word" when we only
> consider byte by byte (which may easily corrupt the second byte of a
> two-byte glyph)?  These are a couple off the top of my head.

If the strings are being stored as wchar_t arrays, I don't think
these are problems.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: multibyte backwarddeletechar
  2001-10-22  5:20   ` Borsenkow Andrej
@ 2001-10-22 11:32     ` Clint Adams
  2001-10-22 12:02       ` Borsenkow Andrej
  0 siblings, 1 reply; 11+ messages in thread
From: Clint Adams @ 2001-10-22 11:32 UTC (permalink / raw)
  To: Borsenkow Andrej; +Cc: 'Geoff Wing', 'Zsh Hackers'

> implementation. Using wchar looks portable but the immediate problem is
> that conventional str* functions stop working. Using UTF-8 is appealing

Since there are wide equivalents for most str* functions, that's not
too severe a problem.

I did try once to replace shingetline with something that called
a shingetwline (using wide equivalents) then ran it through wcstombs()
to return the char * that was wanted.  It didn't function properly;
probably something I don't understand about wide characters.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: multibyte backwarddeletechar
  2001-10-22 11:32     ` Clint Adams
@ 2001-10-22 12:02       ` Borsenkow Andrej
  2001-10-24 13:57         ` Clint Adams
  0 siblings, 1 reply; 11+ messages in thread
From: Borsenkow Andrej @ 2001-10-22 12:02 UTC (permalink / raw)
  To: 'Clint Adams'; +Cc: 'Geoff Wing', 'Zsh Hackers'

> 
> > implementation. Using wchar looks portable but the immediate problem
is
> > that conventional str* functions stop working. Using UTF-8 is
appealing
> 
> Since there are wide equivalents for most str* functions, that's not
> too severe a problem.
>

Mmm ... yes. We also need to deal with quoting; that may work just as it
works now with either char constants replaced by wchar constants (do not
know how portable it is) or by using btowc to convert them on the fly -
which assumes locale is upward compatible with ASCII (but we silently
assume it anyway).

> I did try once to replace shingetline with something that called
> a shingetwline (using wide equivalents) then ran it through wcstombs()
> to return the char * that was wanted.  It didn't function properly;
> probably something I don't understand about wide characters.

I am not sure I follow it. What you actually have to do is

- on input: either get plain characters and convert them using btowc
(that is O.K. as starting point) or read multibyte stream with mb*
functions and convert them with mbtowc (that is needed as final result
to be able to deal with UTF-8 encoding finally).

- on output: use either wctob or wctomb.

Looks like you did exactly opposite :-)


-andrej


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: multibyte backwarddeletechar
  2001-10-22 12:02       ` Borsenkow Andrej
@ 2001-10-24 13:57         ` Clint Adams
  0 siblings, 0 replies; 11+ messages in thread
From: Clint Adams @ 2001-10-24 13:57 UTC (permalink / raw)
  To: Borsenkow Andrej; +Cc: 'Geoff Wing', 'Zsh Hackers'

> I am not sure I follow it. What you actually have to do is
> 
> - on input: either get plain characters and convert them using btowc
> (that is O.K. as starting point) or read multibyte stream with mb*
> functions and convert them with mbtowc (that is needed as final result
> to be able to deal with UTF-8 encoding finally).
 
> - on output: use either wctob or wctomb.
> 
> Looks like you did exactly opposite :-)

Indeed.  I suppose I expected fgetwc to be smarter than it is.

wcwidth() seems to be a good answer to the ZLE line-width
determining problem.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2001-10-24 13:57 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-10-21 15:42 multibyte backwarddeletechar Clint Adams
2001-10-21 17:13 ` Bart Schaefer
2001-10-21 18:21   ` Clint Adams
2001-10-22  0:57 ` Geoff Wing
2001-10-22  1:50   ` Clint Adams
2001-10-22  3:23     ` Geoff Wing
2001-10-22 11:27       ` Clint Adams
2001-10-22  5:20   ` Borsenkow Andrej
2001-10-22 11:32     ` Clint Adams
2001-10-22 12:02       ` Borsenkow Andrej
2001-10-24 13:57         ` Clint Adams

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).