zsh-workers
 help / color / mirror / code / Atom feed
* mb_metacharlenconv vs. tokens
@ 2006-09-25 17:45 Andrey Borzenkov
  2006-09-26  9:20 ` Peter Stephenson
  0 siblings, 1 reply; 6+ messages in thread
From: Andrey Borzenkov @ 2006-09-25 17:45 UTC (permalink / raw)
  To: zsh-workers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Apparently mb_metacharlenconv gets passed tokenized string (how is it called
properly in zsh lingua?) It means that the fact length(any-token-char) == 1
is actually side effect of mbrtowc failing miserably and mb_metacharlenconv
returning fallback 1 (at least using UTF-8).

Should not it untokenize character first? If yes, I will provide a fix as
 part of larger patch; if no, I fail to see how it works then.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFFGBXYR6LMutpd94wRAlMPAJ4w7mogXE7p64XZx1KHUHuGyD/7PACghyjp
kuXmVIbSSxnkyt5BWTuz9zM=
=tyPN
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mb_metacharlenconv vs. tokens
  2006-09-25 17:45 mb_metacharlenconv vs. tokens Andrey Borzenkov
@ 2006-09-26  9:20 ` Peter Stephenson
  2006-09-26 18:03   ` Andrey Borzenkov
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Stephenson @ 2006-09-26  9:20 UTC (permalink / raw)
  To: zsh-workers

Andrey Borzenkov wrote:
> Apparently mb_metacharlenconv gets passed tokenized string (how is it called
> properly in zsh lingua?) It means that the fact length(any-token-char) == 1
> is actually side effect of mbrtowc failing miserably and mb_metacharlenconv
> returning fallback 1 (at least using UTF-8).
> 
> Should not it untokenize character first? If yes, I will provide a fix as
>  part of larger patch; if no, I fail to see how it works then.

It's a bug if it's getting a tokenized character.  It should have been
untokenized at some point in the sequence leading to the call, since
mb_metacharlenconv() should already be dealing with a printable
string---by implication, we're looking at its width or length in
characters because we're about to do something to it which is
appropriate to a printable string.  However, this being zsh, the exact
point at which untokenization should be done is not necessarily obvious.
This is particularly true if we're using the string as a pattern.

I had a case like this in substitution recently:

2006-09-12  Peter Stephenson  <pws@csr.com>

	* 22689: Src/subst.c, Test/D04parameter.ztst: untokenize
	strings for substitution in cases like
	${${~:-*}//(#m)*/$MATCH=$MATCH}.  The pattern code tried
	to metafy the tokens, which caused chaos.

We're a victim of the fact that metafied strings are used both to
protect tokens and to protect embedded NULs; it tends to hide the logic
indicating that tokens are still around when they shouldn't be.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


To access the latest news from CSR copy this link into a web browser:  http://www.csr.com/email_sig.php


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mb_metacharlenconv vs. tokens
  2006-09-26  9:20 ` Peter Stephenson
@ 2006-09-26 18:03   ` Andrey Borzenkov
  2006-09-26 18:10     ` Peter Stephenson
  0 siblings, 1 reply; 6+ messages in thread
From: Andrey Borzenkov @ 2006-09-26 18:03 UTC (permalink / raw)
  To: zsh-workers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tuesday 26 September 2006 13:20, Peter Stephenson wrote:
> Andrey Borzenkov wrote:
> > Apparently mb_metacharlenconv gets passed tokenized string (how is it
> > called properly in zsh lingua?) It means that the fact
> > length(any-token-char) == 1 is actually side effect of mbrtowc failing
> > miserably and mb_metacharlenconv returning fallback 1 (at least using
> > UTF-8).
> >
> > Should not it untokenize character first? If yes, I will provide a fix as
> >  part of larger patch; if no, I fail to see how it works then.
>
> It's a bug if it's getting a tokenized character. 

Then this is very basic bug, because as simple as running function from V01 
test results in:

Breakpoint 2, mb_metacharlenconv (
    
s=0xb7c41951 "\215\210M\211\205\215\210f\211\231\212\210zmodload -d\211\231\216:
\204\207:\207 \205m\210 \207\214\211\216", wcp=0xbfd4ef88)

which corresponds to the line in zmodunload:

if [[ -z ${(M)${(f)"$(zmodload -d)"}:#*:* $m( *|)} ]]


> It should have been 
> untokenized at some point in the sequence leading to the call, 

#0  mb_metacharlenconv (
    
s=0xb7c41951 "\215\210M\211\205\215\210f\211\231\212\210zmodload -d\211\231\216:
\204\207:\207 \205m\210 \207\214\211\216", wcp=0xbfd4ef88)
    at /home/bor/src/zsh/Src/utils.c:3999
#1  0x080cab82 in itype_end (
    
ptr=0xb7c41951 "\215\210M\211\205\215\210f\211\231\212\210zmodload -d\211\231\216:
\204\207:\207 \205m\210 \207\214\211\216", itype=128, once=1)
    at /home/bor/src/zsh/Src/utils.c:3064
#2  0x080bd2a4 in paramsubst (l=0xbfd4f7ac, n=0xbfd4f7a0, str=0xbfd4f388,
    qt=0, ssub=4) at /home/bor/src/zsh/Src/subst.c:1499
#3  0x080ba738 in stringsubst (list=0xbfd4f7ac, node=0xbfd4f7a0, ssub=4,
    asssub=0) at /home/bor/src/zsh/Src/subst.c:156
#4  0x080ba089 in prefork (list=0xbfd4f7ac, flags=4)
    at /home/bor/src/zsh/Src/subst.c:91
#5  0x080bacbb in singsub (s=0xbfd4f8ac) at /home/bor/src/zsh/Src/subst.c:308
#6  0x0806633b in evalcond (state=0xbfd5022c, fromtest=0x0)
    at /home/bor/src/zsh/Src/cond.c:151
#7  0x0806fe61 in execcond (state=0xbfd5022c, do_exec=0)
    at /home/bor/src/zsh/Src/exec.c:3423
#8  0x08068d91 in execsimple (state=0xbfd5022c)
    at /home/bor/src/zsh/Src/exec.c:827
#9  0x08068e6c in execlist (state=0xbfd5022c, dont_change_job=1, exiting=0)
    at /home/bor/src/zsh/Src/exec.c:873
#10 0x080909e6 in execif (state=0xbfd5022c, do_exec=0)
    at /home/bor/src/zsh/Src/loop.c:505
#11 0x0806dcfe in execcmd (state=0xbfd5022c, input=0, output=0, how=18,
    last1=2) at /home/bor/src/zsh/Src/exec.c:2535
#12 0x0806a1dc in execpline2 (state=0xbfd5022c, pcode=387, how=18, input=0,
    output=0, last1=0) at /home/bor/src/zsh/Src/exec.c:1301
#13 0x0806967e in execpline (state=0xbfd5022c, slcode=38914, how=18, last1=0)
    at /home/bor/src/zsh/Src/exec.c:1087
#14 0x08068f51 in execlist (state=0xbfd5022c, dont_change_job=1, exiting=0)
    at /home/bor/src/zsh/Src/exec.c:893
#15 0x0808fc1b in execfor (state=0xbfd5022c, do_exec=0)
    at /home/bor/src/zsh/Src/loop.c:159
#16 0x0806dcfe in execcmd (state=0xbfd5022c, input=0, output=0, how=2, 
last1=2)
    at /home/bor/src/zsh/Src/exec.c:2535
#17 0x0806a1dc in execpline2 (state=0xbfd5022c, pcode=259, how=2, input=0,
    output=0, last1=0) at /home/bor/src/zsh/Src/exec.c:1301
#18 0x0806967e in execpline (state=0xbfd5022c, slcode=45058, how=2, last1=0)
    at /home/bor/src/zsh/Src/exec.c:1087
#19 0x08068f51 in execlist (state=0xbfd5022c, dont_change_job=1, exiting=0)
    at /home/bor/src/zsh/Src/exec.c:893
#20 0x08068c56 in execode (p=0x8102e38, dont_change_job=1, exiting=0)
    at /home/bor/src/zsh/Src/exec.c:793
#21 0x08070fb2 in runshfunc (prog=0x8102e38, wrap=0x0,
    name=0xb7c414f0 "zmodunload") at /home/bor/src/zsh/Src/exec.c:3915
#22 0xb7bd82ac in ?? ()
#23 0x08102e38 in ?? ()
#24 0x00000000 in ?? ()


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFFGWt9R6LMutpd94wRAhLVAJ42cEquhyUjkLMs+sdIsTTsyGMJaACg1K9Q
6EcDF/yMVKO/utoX5BN4Wfg=
=6YOZ
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mb_metacharlenconv vs. tokens
  2006-09-26 18:03   ` Andrey Borzenkov
@ 2006-09-26 18:10     ` Peter Stephenson
  2006-09-27 16:31       ` Andrey Borzenkov
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Stephenson @ 2006-09-26 18:10 UTC (permalink / raw)
  To: zsh-workers

Andrey Borzenkov wrote:
> Then this is very basic bug, because as simple as running function from V01 
> test results in:
> 
> Breakpoint 2, mb_metacharlenconv (
>     
> s=0xb7c41951 "\215\210M\211\205\215\210f\211\231\212\210zmodload -d\211\231\2
> 16:
> \204\207:\207 \205m\210 \207\214\211\216", wcp=0xbfd4ef88)
> 
> which corresponds to the line in zmodunload:
> 
> if [[ -z ${(M)${(f)"$(zmodload -d)"}:#*:* $m( *|)} ]]

> > It should have been 
> > untokenized at some point in the sequence leading to the call, 
> 
> #2  0x080bd2a4 in paramsubst (l=0xbfd4f7ac, n=0xbfd4f7a0, str=0xbfd4f388,
>     qt=0, ssub=4) at /home/bor/src/zsh/Src/subst.c:1499

The problem is probably here (or around here... there's been some
recursive jiggery pokery).  We should untokenize a nested
substitution before trying to do anything with it, and only tokenize it
later if the effect of GLOB_SUBST is present.  This is roughly where I
saw the problem before.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


To access the latest news from CSR copy this link into a web browser:  http://www.csr.com/email_sig.php


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mb_metacharlenconv vs. tokens
  2006-09-26 18:10     ` Peter Stephenson
@ 2006-09-27 16:31       ` Andrey Borzenkov
  2006-09-27 16:51         ` Peter Stephenson
  0 siblings, 1 reply; 6+ messages in thread
From: Andrey Borzenkov @ 2006-09-27 16:31 UTC (permalink / raw)
  To: zsh-workers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tuesday 26 September 2006 22:10, Peter Stephenson wrote:
> Andrey Borzenkov wrote:
> > Then this is very basic bug, because as simple as running function from
> > V01 test results in:
> >
> > Breakpoint 2, mb_metacharlenconv (
> >
> > s=0xb7c41951 "\215\210M\211\205\215\210f\211\231\212\210zmodload
> > -d\211\231\2 16:
> > \204\207:\207 \205m\210 \207\214\211\216", wcp=0xbfd4ef88)
> >
> > which corresponds to the line in zmodunload:
> >
> > if [[ -z ${(M)${(f)"$(zmodload -d)"}:#*:* $m( *|)} ]]
> >
> > > It should have been
> > > untokenized at some point in the sequence leading to the call,
> >
> > #2  0x080bd2a4 in paramsubst (l=0xbfd4f7ac, n=0xbfd4f7a0, str=0xbfd4f388,
> >     qt=0, ssub=4) at /home/bor/src/zsh/Src/subst.c:1499
>
> The problem is probably here (or around here... there's been some
> recursive jiggery pokery). 

This is one seems pretty much top level.

> We should untokenize a nested 
> substitution before trying to do anything with it, and only tokenize it
> later if the effect of GLOB_SUBST is present.  This is roughly where I
> saw the problem before.

I am not sure if this is really possible (or feasible). While paramsubst case 
is trivially solved (worked around actually) by

Index: Src/subst.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/subst.c,v
retrieving revision 1.63
diff -u -p -r1.63 subst.c
- --- Src/subst.c 23 Sep 2006 20:25:06 -0000      1.63
+++ Src/subst.c 27 Sep 2006 16:23:41 -0000
@@ -1496,12 +1496,13 @@ paramsubst(LinkList l, LinkNode n, char
      * these later on, too.
      */
     c = *s;
- -    if (itype_end(s, IIDENT, 1) == s && *s != '#' && c != Pound &&
+    if (*s != '#' && c != Pound &&
        c != '-' && c != '!' && c != '$' && c != String && c != Qstring &&
        c != '?' && c != Quest &&
        c != '*' && c != Star && c != '@' && c != '{' &&
        c != Inbrace && c != '=' && c != Equals && c != Hat &&
- -       c != '^' && c != '~' && c != Tilde && c != '+') {
+       c != '^' && c != '~' && c != Tilde && c != '+' &&
+       (itok(c) || itype_end(s, IIDENT, 1) == s)) {
        s[-1] = '$';
        *str = s;
        return n;


fetchvalue() case is not; and in fetchvalue() we explicitly look for tokenized 
string.

I am afraid that mb_metastrlenconv is rather overloaded. As it looks like 
traversing string character by character is valid operation for input as 
well, so we cannot exclude tokens there.

Let's put it differently - what we intend is to avoid passing bogus character 
to mbrtowc(). If we *know* the context is tokenized we could just as well 
pass a flag to itype_end() and mb_metacharlenconv() so they will check for 
tokens and skip them. Does it actually make sense?

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFFGqdnR6LMutpd94wRAvpwAKCfhCFzK6FdvNnvSGHkDx8rx2PjnQCcDaZp
ThXNuFDZQcgDGLknXoAY5jE=
=fLqH
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mb_metacharlenconv vs. tokens
  2006-09-27 16:31       ` Andrey Borzenkov
@ 2006-09-27 16:51         ` Peter Stephenson
  0 siblings, 0 replies; 6+ messages in thread
From: Peter Stephenson @ 2006-09-27 16:51 UTC (permalink / raw)
  To: zsh-workers

Andrey Borzenkov wrote:
> I am afraid that mb_metastrlenconv is rather overloaded. As it looks like 
> traversing string character by character is valid operation for input as 
> well, so we cannot exclude tokens there.
> 
> Let's put it differently - what we intend is to avoid passing bogus character
> to mbrtowc(). If we *know* the context is tokenized we could just as well 
> pass a flag to itype_end() and mb_metacharlenconv() so they will check for 
> tokens and skip them. Does it actually make sense?

Yes, that makes perfect sense.  It's then a case of deciding on the
context, but that's a lot less heavyweight than untokenizing.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


To access the latest news from CSR copy this link into a web browser:  http://www.csr.com/email_sig.php


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-09-27 16:52 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-09-25 17:45 mb_metacharlenconv vs. tokens Andrey Borzenkov
2006-09-26  9:20 ` Peter Stephenson
2006-09-26 18:03   ` Andrey Borzenkov
2006-09-26 18:10     ` Peter Stephenson
2006-09-27 16:31       ` Andrey Borzenkov
2006-09-27 16:51         ` Peter Stephenson

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).