On Wed, Oct 18, 2006 at 06:20:19PM +0100, Peter Stephenson wrote:
> Alexey Tourbin <at@altlinux.ru> wrote:
> > Thanks for the clue.  git-bisect now blames 22544.
> 
> That patch made the shell smarter about finding the end of
> special types of string known to the shell (identifiers in particular),
> using the multibyte code.
> 
> I wonder if it's part of the problem Andrey noted?  At some points the
> string we apply this too may contain tokenized characters, which
> aren't valid multibyte characters.  Since the string must be metafied,
> these are easy to detect.
> 
> The simplest fix is just to ensure we don't try to handle these as
> mulitbyte characters, telling the caller they're invalid.  Most callers
> will just handle it as a single-byte character and move on, which
> is the right thing to do; some callers which really need valid characters
> will abort, but they shouldn't be getting a tokenized string.  So
> this might actually work.  If not, we need to be smarter, but probably at a
> higher level.
> 
> We need some fix like this even if it isn't the root of the present
> problem.  (If I could reproduce that it ought now to be easy to trace.)
> 
> Index: Src/utils.c
> ===================================================================
> RCS file: /cvsroot/zsh/zsh/Src/utils.c,v
> retrieving revision 1.142
> diff -u -r1.142 utils.c
> --- Src/utils.c	10 Oct 2006 09:37:19 -0000	1.142
> +++ Src/utils.c	18 Oct 2006 17:09:16 -0000
> @@ -4003,6 +4003,21 @@
>  	    *wcp = (wint_t)(*s == Meta ? s[1] ^ 32 : *s);
>  	return 1 + (*s == Meta);
>      }
> +    /*
> +     * We have to handle tokens here, since we may be looking
> +     * through a tokenized input.  Obviously this isn't
> +     * a valid multibyte character, so just return WEOF
> +     * and let the caller handle it as a single character.
> +     *
> +     * TODO: I've a sneaking suspicion we could do more here
> +     * to prevent the caller always needing to handle invalid
> +     * characters specially, but sometimes it may need to know.
> +     */
> +    if (itok(*s)) {
> +	if (wcp)
> +	    *wcp = EOF;
> +	return 1;
> +    }
>  
>      ret = MB_INVALID;
>      for (ptr = s; *ptr; ) {

Thanks Peter!  This patch resolves the problem.

(I quote the whole message because apparently it was not CC'ed to
zsh-wokers.)

Unfortunately I don't quite understand unicode issues in zsh.  I build
zsh rpm package because I use it (and a few others use it, too).  The
latest stable 4.2 release had problems in utf8 console, so I decided
to move to then-current cvs snapshot.  I got my first decently working
utf8-enabled zsh with 20050926 snapshot.

So as for now there's just about the only thing I can provide is feedback.
This will change as I grok zsh code.

BTW, git archive is available at
git://git.altlinux.org/people/at/packages/zsh.git
The 'master' branch is for my own cooking, but "cvs" branch, as well
as "zsh-4_0-patches" and "zsh-4_2-patches" have pristine zsh sources.
I verified "cvs" branch against checkout, and it's almost zero-diff
(the only exception is that there's very old Completion/Core/_closequotes
is in there, but is not in checkout).  I used Keith Packard's "parsecvs"
(with my changes, some of which already merged into mainline).

> -- 
> Peter Stephenson <pws@csr.com>                  Software Engineer
> CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
> Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070
> 
> 
> To access the latest news from CSR copy this link into a web browser:  http://www.csr.com/email_sig.php