Another idea on how to insert illegal multibyte characters

zsh-workers
 help / color / mirror / code / Atom feed

* Another idea on how to insert illegal multibyte characters
@ 2006-01-12  3:42 Wayne Davison
  2006-01-12  9:23 ` Peter Stephenson
  2006-01-13  1:00 ` Clint Adams
  0 siblings, 2 replies; 5+ messages in thread
From: Wayne Davison @ 2006-01-12  3:42 UTC (permalink / raw)
  To: zsh-workers

[-- Attachment #1: Type: text/plain, Size: 680 bytes --]

Here's another idea on how filenames with illegal byte sequences could
be inserted in the command line:  insert a $'\321' string for each one.
Since this idiom uses plain ASCII, it inserts into the line just fine.
It also has the advantage that zsh will interpret the sequence back into
the appropriate character.

I created an initial patch for this.  It works to insert the necessary
letters into the command-line, but has a bug where tab completion will
not remove enough characters when moving from match to match if the
previous match had one or more expanded $'\321' sequences.  If folks
like this idea, I imagine this bug wouldn't be too hard to fix.

Opinions?

..wayne..

[-- Attachment #2: another-try.diff --]
[-- Type: text/plain, Size: 3036 bytes --]

--- Src/Zle/zle_utils.c	12 Jan 2006 01:04:17 -0000	1.36
+++ Src/Zle/zle_utils.c	12 Jan 2006 03:12:34 -0000
@@ -233,8 +233,9 @@ mod_export ZLE_STRING_T
 stringaszleline(char *instr, int incs, int *outll, int *outsz, int *outcs)
 {
     ZLE_STRING_T outstr;
-    int ll, sz;
+    int ll;
 #ifdef MULTIBYTE_SUPPORT
+    int eol = 0;
     mbstate_t mbs;
 #endif
 
@@ -256,17 +257,15 @@ stringaszleline(char *instr, int incs, i
     }
     unmetafy(instr, &ll);
 
-    /*
-     * ll is the maximum number of characters there can be in
-     * the output string; the closer to ASCII the string, the
-     * better the guess.  For the 2 see above.
-     */
-    sz = (ll + 2) * ZLE_CHAR_SIZE;
+#ifdef MULTIBYTE_SUPPORT
+    /* Compute the maximum amount of memory we'll need, which takes the
+     * pessimistic view that every character in the input needs to turn
+     * into a $'\321' string in the output.  For the reason for the +2,
+     * see the function comments. */
     if (outsz)
-	*outsz = ll;
-    outstr = (ZLE_STRING_T)zalloc(sz);
+	*outsz = ll * 7;
+    outstr = (ZLE_STRING_T)zalloc((ll*7 + 2) * ZLE_CHAR_SIZE);
 
-#ifdef MULTIBYTE_SUPPORT
     if (ll) {
 	char *inptr = instr;
 	wchar_t *outptr = outstr;
@@ -275,22 +274,36 @@ stringaszleline(char *instr, int incs, i
 	memset(&mbs, '\0', sizeof mbs);
 
 	while (ll > 0) {
-	    size_t cnt = mbrtowc(outptr, inptr, ll, &mbs);
+	    size_t cnt = eol ? MB_INVALID : mbrtowc(outptr, inptr, ll, &mbs);
 
-	    /*
-	     * At this point we don't handle either incomplete (-2) or
-	     * invalid (-1) multibyte sequences.  Use the current length
-	     * and return.
-	     */
-	    if (cnt == MB_INCOMPLETE || cnt == MB_INVALID)
+	    switch (cnt) {
+	    case MB_INCOMPLETE:
+		eol = 1;
+		/* FALL THROUGH */
+	    case MB_INVALID:
+		/* Get mbs out of its undefined state. */
+		memset(&mbs, '\0', sizeof mbs);
+		/* Transform invalid character sequences into $'\321'
+		 * strings that will be converted by the shell into
+		 * the appropriate character. */
+		*outptr++ = L'$';
+		*outptr++ = L'\'';
+		*outptr++ = L'\\';
+		*outptr++ = L'0' + (STOUC(*inptr) / 0100);
+		*outptr++ = L'0' + ((STOUC(*inptr) / 010) & 07);
+		*outptr++ = L'0' + (STOUC(*inptr) & 07);
+		*outptr = L'\'';
+		cnt = 1;
 		break;
-
-	    if (cnt == 0) {
+	    case 0:
 		/* Converting '\0' returns 0, but a '\0' is a real
 		 * character for us, so we should consume 1 byte
 		 * (certainly true for Unicode and unlikely to be false
 		 * in any non-pathological multibyte representation). */
 		cnt = 1;
+		/* FALL THROUGH */
+	    default:
+		break;
 	    }
 
 	    if (outcs) {
@@ -311,7 +324,15 @@ stringaszleline(char *instr, int incs, i
 	if (outcs)
 	    *outcs = 0;
     }
-#else
+
+#else /* !MULTIBYTE_SUPPORT */
+
+    if (outsz)
+	*outsz = ll;
+    /* ll is the number of characters in the unmetafied string.  For the
+     * reason for the +2, see the function comments. */
+    outstr = (ZLE_STRING_T)zalloc(ll + 2);
+
     memcpy(outstr, instr, ll);
     *outll = ll;
     if (outcs)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Another idea on how to insert illegal multibyte characters
  2006-01-12  3:42 Another idea on how to insert illegal multibyte characters Wayne Davison
@ 2006-01-12  9:23 ` Peter Stephenson
  2006-02-11 10:19   ` Wayne Davison
  2006-01-13  1:00 ` Clint Adams
  1 sibling, 1 reply; 5+ messages in thread
From: Peter Stephenson @ 2006-01-12  9:23 UTC (permalink / raw)
  To: Zsh hackers list

Wayne Davison wrote:
> Here's another idea on how filenames with illegal byte sequences could
> be inserted in the command line:  insert a $'\321' string for each one.

That ought to work quite well, although to do it completely consistently
you'd have to worry about quoting, which is difficult at that point
inside zle.  Filenames aren't usually quoted, except using backslashes,
so this will work most of the time, but every now and then it won't.
I certainly think it's good enough for now.

The completion system is a bit more quoting aware: it knows whether or
not it needs to insert a backslash before special characters because of
quotes earlier on the line.  Ideally it should handle unprintable
characters at the same point where it tries to do that.  That doesn't
need to be done at the same time, though.  (I would hope it could be
done independently and prevent the equivalent code inside zle kicking
in.)

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070

Your mail client is unable to display the latest news from CSR. To access our news copy this link into a web browser:  http://www.csr.com/email_sig.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Another idea on how to insert illegal multibyte characters
  2006-01-12  3:42 Another idea on how to insert illegal multibyte characters Wayne Davison
  2006-01-12  9:23 ` Peter Stephenson
@ 2006-01-13  1:00 ` Clint Adams
  1 sibling, 0 replies; 5+ messages in thread
From: Clint Adams @ 2006-01-13  1:00 UTC (permalink / raw)
  To: Wayne Davison; +Cc: zsh-workers

> I created an initial patch for this.  It works to insert the necessary
> letters into the command-line, but has a bug where tab completion will
> not remove enough characters when moving from match to match if the
> previous match had one or more expanded $'\321' sequences.  If folks
> like this idea, I imagine this bug wouldn't be too hard to fix.

Sounds better than the status quo.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Another idea on how to insert illegal multibyte characters
  2006-01-12  9:23 ` Peter Stephenson
@ 2006-02-11 10:19   ` Wayne Davison
  2006-02-13 10:54     ` Peter Stephenson
  0 siblings, 1 reply; 5+ messages in thread
From: Wayne Davison @ 2006-02-11 10:19 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

[-- Attachment #1: Type: text/plain, Size: 1326 bytes --]

On Thu, Jan 12, 2006 at 09:23:19AM +0000, Peter Stephenson wrote:
> The completion system is a bit more quoting aware: it knows whether or
> not it needs to insert a backslash before special characters because of
> quotes earlier on the line.  Ideally it should handle unprintable
> characters at the same point where it tries to do that.  That doesn't
> need to be done at the same time, though.  (I would hope it could be
> done independently and prevent the equivalent code inside zle kicking
> in.)

The attached patch is an alternative to my older patch that changed
stringaszleline().  This one changes add_match_data(), which means that
it is happening early enough that zsh could be made to figure out how
to insert the $'\123' sequences into single- or double-quoted strings
(though it does not yet do this).  This patch also fixes the updating
glitch that I mentioned my last patch had.

I think this would be good enough to include in the next release.  It
would at least make the completion of filenames with invalid charset
sequences possible, which is better than the current truncating.
Thoughts?

One caveat about my renaming of "sl" to "stl":  add_match_data() had two
variables with the same name (one more deeply nested), so I changed the
outer one (which holds the length of "str") to be "stl".

..wayne..

[-- Attachment #2: multibyte.patch --]
[-- Type: text/plain, Size: 2746 bytes --]

--- Src/Zle/compcore.c	15 Nov 2005 08:44:18 -0000	1.78
+++ Src/Zle/compcore.c	11 Feb 2006 09:44:45 -0000
@@ -2227,10 +2227,15 @@ add_match_data(int alt, char *str, char 
 	       char *psuf, Cline sline,
 	       char *suf, int flags, int exact)
 {
+#ifdef MULTIBYTE_SUPPORT
+    mbstate_t mbs;
+    char *t, *f, *new_str = NULL;
+    int fl, eol = 0;
+#endif
     Cmatch cm;
     Aminfo ai = (alt ? fainfo : ainfo);
     int palen, salen, qipl, ipl, pl, ppl, qisl, isl, psl;
-    int sl, lpl, lsl, ml;
+    int stl, lpl, lsl, ml;
 
     palen = salen = qipl = ipl = pl = ppl = qisl = isl = psl = 0;
 
@@ -2445,6 +2450,59 @@ add_match_data(int alt, char *str, char 
 	    line = p;
 	}
     }
+
+    stl = strlen(str);
+#ifdef MULTIBYTE_SUPPORT
+    /* If "str" contains a character that won't convert into a wide
+     * character, change it into a $'\123' sequence. */
+    memset(&mbs, '\0', sizeof mbs);
+    for (t = f = str, fl = stl; fl > 0; ) {
+	wchar_t wc;
+	size_t cnt = eol ? MB_INVALID : mbrtowc(&wc, f, fl, &mbs);
+	switch (cnt) {
+	case MB_INCOMPLETE:
+	    eol = 1;
+	    /* FALL THROUGH */
+	case MB_INVALID:
+	    /* Get mbs out of its undefined state. */
+	    memset(&mbs, '\0', sizeof mbs);
+	    if (!new_str) {
+		/* Be very pessimistic about how much space we'll need. */
+		new_str = zhalloc(stl*7 + 1);
+		memcpy(new_str, str, t - str);
+		t = new_str + (t - str);
+	    }
+	    *t++ = '$';
+	    *t++ = '\'';
+	    *t++ = '\\';
+	    *t++ = '0' + ((STOUC(*f) >> 6) & 7);
+	    *t++ = '0' + ((STOUC(*f) >> 3) & 7);
+	    *t++ = '0' + (STOUC(*f) & 7);
+	    *t++ = '\'';
+	    f++;
+	    fl--;
+	    break;
+	case 0:
+	    /* Converting '\0' returns 0, but a '\0' is a real
+	     * character for us, so we should consume 1 byte
+	     * (certainly true for Unicode and unlikely to be false
+	     * in any non-pathological multibyte representation). */
+	    cnt = 1;
+	    /* FALL THROUGH */
+	default:
+	    fl -= cnt;
+	    while (cnt--)
+		*t++ = *f++;
+	    break;
+	}
+    }
+    if (new_str) {
+	*t = '\0';
+	str = new_str;
+	stl = strlen(str);
+    }
+#endif
+
     /* Allocate and fill the match structure. */
     cm = (Cmatch) zhalloc(sizeof(struct cmatch));
     cm->str = str;
@@ -2539,10 +2597,9 @@ add_match_data(int alt, char *str, char 
     if (!ai->firstm)
 	ai->firstm = cm;
 
-    sl = strlen(str);
     lpl = (cm->ppre ? strlen(cm->ppre) : 0);
     lsl = (cm->psuf ? strlen(cm->psuf) : 0);
-    ml = sl + lpl + lsl;
+    ml = stl + lpl + lsl;
 
     if (ml < minmlen)
 	minmlen = ml;
@@ -2566,7 +2623,7 @@ add_match_data(int alt, char *str, char 
 		    e += lpl;
 		}
 		strcpy(e, str);
-		e += sl;
+		e += stl;
 		if (cm->psuf)
 		    strcpy(e, cm->psuf);
 		comp_setunset(0, 0, CP_EXACTSTR, 0);

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Another idea on how to insert illegal multibyte characters
  2006-02-11 10:19   ` Wayne Davison
@ 2006-02-13 10:54     ` Peter Stephenson
  0 siblings, 0 replies; 5+ messages in thread
From: Peter Stephenson @ 2006-02-13 10:54 UTC (permalink / raw)
  To: Zsh hackers list

Wayne Davison wrote:
> On Thu, Jan 12, 2006 at 09:23:19AM +0000, Peter Stephenson wrote:
> > The completion system is a bit more quoting aware: it knows whether or
> > not it needs to insert a backslash before special characters because of
> > quotes earlier on the line.  Ideally it should handle unprintable
> > characters at the same point where it tries to do that.  That doesn't
> > need to be done at the same time, though.  (I would hope it could be
> > done independently and prevent the equivalent code inside zle kicking
> > in.)
> 
> The attached patch is an alternative to my older patch that changed
> stringaszleline().  This one changes add_match_data(), which means that
> it is happening early enough that zsh could be made to figure out how
> to insert the $'\123' sequences into single- or double-quoted strings
> (though it does not yet do this).  This patch also fixes the updating
> glitch that I mentioned my last patch had.

Any reasonably consistent attempt to do this, even if it's incomplete,
strikes me as a good thing.  4.3.1 isn't going to be completely
multibyte-aware anyway.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


To access the latest news from CSR copy this link into a web browser:  http://www.csr.com/email_sig.php


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2006-02-13 10:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-12  3:42 Another idea on how to insert illegal multibyte characters Wayne Davison
2006-01-12  9:23 ` Peter Stephenson
2006-02-11 10:19   ` Wayne Davison
2006-02-13 10:54     ` Peter Stephenson
2006-01-13  1:00 ` Clint Adams

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).