On Wed, Oct 18, 2006 at 06:20:19PM +0100, Peter Stephenson wrote: > Alexey Tourbin wrote: > > Thanks for the clue. git-bisect now blames 22544. > > That patch made the shell smarter about finding the end of > special types of string known to the shell (identifiers in particular), > using the multibyte code. > > I wonder if it's part of the problem Andrey noted? At some points the > string we apply this too may contain tokenized characters, which > aren't valid multibyte characters. Since the string must be metafied, > these are easy to detect. > > The simplest fix is just to ensure we don't try to handle these as > mulitbyte characters, telling the caller they're invalid. Most callers > will just handle it as a single-byte character and move on, which > is the right thing to do; some callers which really need valid characters > will abort, but they shouldn't be getting a tokenized string. So > this might actually work. If not, we need to be smarter, but probably at a > higher level. > > We need some fix like this even if it isn't the root of the present > problem. (If I could reproduce that it ought now to be easy to trace.) > > Index: Src/utils.c > =================================================================== > RCS file: /cvsroot/zsh/zsh/Src/utils.c,v > retrieving revision 1.142 > diff -u -r1.142 utils.c > --- Src/utils.c 10 Oct 2006 09:37:19 -0000 1.142 > +++ Src/utils.c 18 Oct 2006 17:09:16 -0000 > @@ -4003,6 +4003,21 @@ > *wcp = (wint_t)(*s == Meta ? s[1] ^ 32 : *s); > return 1 + (*s == Meta); > } > + /* > + * We have to handle tokens here, since we may be looking > + * through a tokenized input. Obviously this isn't > + * a valid multibyte character, so just return WEOF > + * and let the caller handle it as a single character. > + * > + * TODO: I've a sneaking suspicion we could do more here > + * to prevent the caller always needing to handle invalid > + * characters specially, but sometimes it may need to know. > + */ > + if (itok(*s)) { > + if (wcp) > + *wcp = EOF; > + return 1; > + } > > ret = MB_INVALID; > for (ptr = s; *ptr; ) { Thanks Peter! This patch resolves the problem. (I quote the whole message because apparently it was not CC'ed to zsh-wokers.) Unfortunately I don't quite understand unicode issues in zsh. I build zsh rpm package because I use it (and a few others use it, too). The latest stable 4.2 release had problems in utf8 console, so I decided to move to then-current cvs snapshot. I got my first decently working utf8-enabled zsh with 20050926 snapshot. So as for now there's just about the only thing I can provide is feedback. This will change as I grok zsh code. BTW, git archive is available at git://git.altlinux.org/people/at/packages/zsh.git The 'master' branch is for my own cooking, but "cvs" branch, as well as "zsh-4_0-patches" and "zsh-4_2-patches" have pristine zsh sources. I verified "cvs" branch against checkout, and it's almost zero-diff (the only exception is that there's very old Completion/Core/_closequotes is in there, but is not in checkout). I used Keith Packard's "parsecvs" (with my changes, some of which already merged into mainline). > -- > Peter Stephenson Software Engineer > CSR PLC, Churchill House, Cambridge Business Park, Cowley Road > Cambridge, CB4 0WZ, UK Tel: +44 (0)1223 692070 > > > To access the latest news from CSR copy this link into a web browser: http://www.csr.com/email_sig.php