* mb_metacharlenconv vs. tokens @ 2006-09-25 17:45 Andrey Borzenkov 2006-09-26 9:20 ` Peter Stephenson 0 siblings, 1 reply; 6+ messages in thread From: Andrey Borzenkov @ 2006-09-25 17:45 UTC (permalink / raw) To: zsh-workers -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Apparently mb_metacharlenconv gets passed tokenized string (how is it called properly in zsh lingua?) It means that the fact length(any-token-char) == 1 is actually side effect of mbrtowc failing miserably and mb_metacharlenconv returning fallback 1 (at least using UTF-8). Should not it untokenize character first? If yes, I will provide a fix as part of larger patch; if no, I fail to see how it works then. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD8DBQFFGBXYR6LMutpd94wRAlMPAJ4w7mogXE7p64XZx1KHUHuGyD/7PACghyjp kuXmVIbSSxnkyt5BWTuz9zM= =tyPN -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: mb_metacharlenconv vs. tokens 2006-09-25 17:45 mb_metacharlenconv vs. tokens Andrey Borzenkov @ 2006-09-26 9:20 ` Peter Stephenson 2006-09-26 18:03 ` Andrey Borzenkov 0 siblings, 1 reply; 6+ messages in thread From: Peter Stephenson @ 2006-09-26 9:20 UTC (permalink / raw) To: zsh-workers Andrey Borzenkov wrote: > Apparently mb_metacharlenconv gets passed tokenized string (how is it called > properly in zsh lingua?) It means that the fact length(any-token-char) == 1 > is actually side effect of mbrtowc failing miserably and mb_metacharlenconv > returning fallback 1 (at least using UTF-8). > > Should not it untokenize character first? If yes, I will provide a fix as > part of larger patch; if no, I fail to see how it works then. It's a bug if it's getting a tokenized character. It should have been untokenized at some point in the sequence leading to the call, since mb_metacharlenconv() should already be dealing with a printable string---by implication, we're looking at its width or length in characters because we're about to do something to it which is appropriate to a printable string. However, this being zsh, the exact point at which untokenization should be done is not necessarily obvious. This is particularly true if we're using the string as a pattern. I had a case like this in substitution recently: 2006-09-12 Peter Stephenson <pws@csr.com> * 22689: Src/subst.c, Test/D04parameter.ztst: untokenize strings for substitution in cases like ${${~:-*}//(#m)*/$MATCH=$MATCH}. The pattern code tried to metafy the tokens, which caused chaos. We're a victim of the fact that metafied strings are used both to protect tokens and to protect embedded NULs; it tends to hide the logic indicating that tokens are still around when they shouldn't be. -- Peter Stephenson <pws@csr.com> Software Engineer CSR PLC, Churchill House, Cambridge Business Park, Cowley Road Cambridge, CB4 0WZ, UK Tel: +44 (0)1223 692070 To access the latest news from CSR copy this link into a web browser: http://www.csr.com/email_sig.php ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: mb_metacharlenconv vs. tokens 2006-09-26 9:20 ` Peter Stephenson @ 2006-09-26 18:03 ` Andrey Borzenkov 2006-09-26 18:10 ` Peter Stephenson 0 siblings, 1 reply; 6+ messages in thread From: Andrey Borzenkov @ 2006-09-26 18:03 UTC (permalink / raw) To: zsh-workers -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tuesday 26 September 2006 13:20, Peter Stephenson wrote: > Andrey Borzenkov wrote: > > Apparently mb_metacharlenconv gets passed tokenized string (how is it > > called properly in zsh lingua?) It means that the fact > > length(any-token-char) == 1 is actually side effect of mbrtowc failing > > miserably and mb_metacharlenconv returning fallback 1 (at least using > > UTF-8). > > > > Should not it untokenize character first? If yes, I will provide a fix as > > part of larger patch; if no, I fail to see how it works then. > > It's a bug if it's getting a tokenized character. Then this is very basic bug, because as simple as running function from V01 test results in: Breakpoint 2, mb_metacharlenconv ( s=0xb7c41951 "\215\210M\211\205\215\210f\211\231\212\210zmodload -d\211\231\216: \204\207:\207 \205m\210 \207\214\211\216", wcp=0xbfd4ef88) which corresponds to the line in zmodunload: if [[ -z ${(M)${(f)"$(zmodload -d)"}:#*:* $m( *|)} ]] > It should have been > untokenized at some point in the sequence leading to the call, #0 mb_metacharlenconv ( s=0xb7c41951 "\215\210M\211\205\215\210f\211\231\212\210zmodload -d\211\231\216: \204\207:\207 \205m\210 \207\214\211\216", wcp=0xbfd4ef88) at /home/bor/src/zsh/Src/utils.c:3999 #1 0x080cab82 in itype_end ( ptr=0xb7c41951 "\215\210M\211\205\215\210f\211\231\212\210zmodload -d\211\231\216: \204\207:\207 \205m\210 \207\214\211\216", itype=128, once=1) at /home/bor/src/zsh/Src/utils.c:3064 #2 0x080bd2a4 in paramsubst (l=0xbfd4f7ac, n=0xbfd4f7a0, str=0xbfd4f388, qt=0, ssub=4) at /home/bor/src/zsh/Src/subst.c:1499 #3 0x080ba738 in stringsubst (list=0xbfd4f7ac, node=0xbfd4f7a0, ssub=4, asssub=0) at /home/bor/src/zsh/Src/subst.c:156 #4 0x080ba089 in prefork (list=0xbfd4f7ac, flags=4) at /home/bor/src/zsh/Src/subst.c:91 #5 0x080bacbb in singsub (s=0xbfd4f8ac) at /home/bor/src/zsh/Src/subst.c:308 #6 0x0806633b in evalcond (state=0xbfd5022c, fromtest=0x0) at /home/bor/src/zsh/Src/cond.c:151 #7 0x0806fe61 in execcond (state=0xbfd5022c, do_exec=0) at /home/bor/src/zsh/Src/exec.c:3423 #8 0x08068d91 in execsimple (state=0xbfd5022c) at /home/bor/src/zsh/Src/exec.c:827 #9 0x08068e6c in execlist (state=0xbfd5022c, dont_change_job=1, exiting=0) at /home/bor/src/zsh/Src/exec.c:873 #10 0x080909e6 in execif (state=0xbfd5022c, do_exec=0) at /home/bor/src/zsh/Src/loop.c:505 #11 0x0806dcfe in execcmd (state=0xbfd5022c, input=0, output=0, how=18, last1=2) at /home/bor/src/zsh/Src/exec.c:2535 #12 0x0806a1dc in execpline2 (state=0xbfd5022c, pcode=387, how=18, input=0, output=0, last1=0) at /home/bor/src/zsh/Src/exec.c:1301 #13 0x0806967e in execpline (state=0xbfd5022c, slcode=38914, how=18, last1=0) at /home/bor/src/zsh/Src/exec.c:1087 #14 0x08068f51 in execlist (state=0xbfd5022c, dont_change_job=1, exiting=0) at /home/bor/src/zsh/Src/exec.c:893 #15 0x0808fc1b in execfor (state=0xbfd5022c, do_exec=0) at /home/bor/src/zsh/Src/loop.c:159 #16 0x0806dcfe in execcmd (state=0xbfd5022c, input=0, output=0, how=2, last1=2) at /home/bor/src/zsh/Src/exec.c:2535 #17 0x0806a1dc in execpline2 (state=0xbfd5022c, pcode=259, how=2, input=0, output=0, last1=0) at /home/bor/src/zsh/Src/exec.c:1301 #18 0x0806967e in execpline (state=0xbfd5022c, slcode=45058, how=2, last1=0) at /home/bor/src/zsh/Src/exec.c:1087 #19 0x08068f51 in execlist (state=0xbfd5022c, dont_change_job=1, exiting=0) at /home/bor/src/zsh/Src/exec.c:893 #20 0x08068c56 in execode (p=0x8102e38, dont_change_job=1, exiting=0) at /home/bor/src/zsh/Src/exec.c:793 #21 0x08070fb2 in runshfunc (prog=0x8102e38, wrap=0x0, name=0xb7c414f0 "zmodunload") at /home/bor/src/zsh/Src/exec.c:3915 #22 0xb7bd82ac in ?? () #23 0x08102e38 in ?? () #24 0x00000000 in ?? () -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD8DBQFFGWt9R6LMutpd94wRAhLVAJ42cEquhyUjkLMs+sdIsTTsyGMJaACg1K9Q 6EcDF/yMVKO/utoX5BN4Wfg= =6YOZ -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: mb_metacharlenconv vs. tokens 2006-09-26 18:03 ` Andrey Borzenkov @ 2006-09-26 18:10 ` Peter Stephenson 2006-09-27 16:31 ` Andrey Borzenkov 0 siblings, 1 reply; 6+ messages in thread From: Peter Stephenson @ 2006-09-26 18:10 UTC (permalink / raw) To: zsh-workers Andrey Borzenkov wrote: > Then this is very basic bug, because as simple as running function from V01 > test results in: > > Breakpoint 2, mb_metacharlenconv ( > > s=0xb7c41951 "\215\210M\211\205\215\210f\211\231\212\210zmodload -d\211\231\2 > 16: > \204\207:\207 \205m\210 \207\214\211\216", wcp=0xbfd4ef88) > > which corresponds to the line in zmodunload: > > if [[ -z ${(M)${(f)"$(zmodload -d)"}:#*:* $m( *|)} ]] > > It should have been > > untokenized at some point in the sequence leading to the call, > > #2 0x080bd2a4 in paramsubst (l=0xbfd4f7ac, n=0xbfd4f7a0, str=0xbfd4f388, > qt=0, ssub=4) at /home/bor/src/zsh/Src/subst.c:1499 The problem is probably here (or around here... there's been some recursive jiggery pokery). We should untokenize a nested substitution before trying to do anything with it, and only tokenize it later if the effect of GLOB_SUBST is present. This is roughly where I saw the problem before. -- Peter Stephenson <pws@csr.com> Software Engineer CSR PLC, Churchill House, Cambridge Business Park, Cowley Road Cambridge, CB4 0WZ, UK Tel: +44 (0)1223 692070 To access the latest news from CSR copy this link into a web browser: http://www.csr.com/email_sig.php ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: mb_metacharlenconv vs. tokens 2006-09-26 18:10 ` Peter Stephenson @ 2006-09-27 16:31 ` Andrey Borzenkov 2006-09-27 16:51 ` Peter Stephenson 0 siblings, 1 reply; 6+ messages in thread From: Andrey Borzenkov @ 2006-09-27 16:31 UTC (permalink / raw) To: zsh-workers -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tuesday 26 September 2006 22:10, Peter Stephenson wrote: > Andrey Borzenkov wrote: > > Then this is very basic bug, because as simple as running function from > > V01 test results in: > > > > Breakpoint 2, mb_metacharlenconv ( > > > > s=0xb7c41951 "\215\210M\211\205\215\210f\211\231\212\210zmodload > > -d\211\231\2 16: > > \204\207:\207 \205m\210 \207\214\211\216", wcp=0xbfd4ef88) > > > > which corresponds to the line in zmodunload: > > > > if [[ -z ${(M)${(f)"$(zmodload -d)"}:#*:* $m( *|)} ]] > > > > > It should have been > > > untokenized at some point in the sequence leading to the call, > > > > #2 0x080bd2a4 in paramsubst (l=0xbfd4f7ac, n=0xbfd4f7a0, str=0xbfd4f388, > > qt=0, ssub=4) at /home/bor/src/zsh/Src/subst.c:1499 > > The problem is probably here (or around here... there's been some > recursive jiggery pokery). This is one seems pretty much top level. > We should untokenize a nested > substitution before trying to do anything with it, and only tokenize it > later if the effect of GLOB_SUBST is present. This is roughly where I > saw the problem before. I am not sure if this is really possible (or feasible). While paramsubst case is trivially solved (worked around actually) by Index: Src/subst.c =================================================================== RCS file: /cvsroot/zsh/zsh/Src/subst.c,v retrieving revision 1.63 diff -u -p -r1.63 subst.c - --- Src/subst.c 23 Sep 2006 20:25:06 -0000 1.63 +++ Src/subst.c 27 Sep 2006 16:23:41 -0000 @@ -1496,12 +1496,13 @@ paramsubst(LinkList l, LinkNode n, char * these later on, too. */ c = *s; - - if (itype_end(s, IIDENT, 1) == s && *s != '#' && c != Pound && + if (*s != '#' && c != Pound && c != '-' && c != '!' && c != '$' && c != String && c != Qstring && c != '?' && c != Quest && c != '*' && c != Star && c != '@' && c != '{' && c != Inbrace && c != '=' && c != Equals && c != Hat && - - c != '^' && c != '~' && c != Tilde && c != '+') { + c != '^' && c != '~' && c != Tilde && c != '+' && + (itok(c) || itype_end(s, IIDENT, 1) == s)) { s[-1] = '$'; *str = s; return n; fetchvalue() case is not; and in fetchvalue() we explicitly look for tokenized string. I am afraid that mb_metastrlenconv is rather overloaded. As it looks like traversing string character by character is valid operation for input as well, so we cannot exclude tokens there. Let's put it differently - what we intend is to avoid passing bogus character to mbrtowc(). If we *know* the context is tokenized we could just as well pass a flag to itype_end() and mb_metacharlenconv() so they will check for tokens and skip them. Does it actually make sense? -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD8DBQFFGqdnR6LMutpd94wRAvpwAKCfhCFzK6FdvNnvSGHkDx8rx2PjnQCcDaZp ThXNuFDZQcgDGLknXoAY5jE= =fLqH -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: mb_metacharlenconv vs. tokens 2006-09-27 16:31 ` Andrey Borzenkov @ 2006-09-27 16:51 ` Peter Stephenson 0 siblings, 0 replies; 6+ messages in thread From: Peter Stephenson @ 2006-09-27 16:51 UTC (permalink / raw) To: zsh-workers Andrey Borzenkov wrote: > I am afraid that mb_metastrlenconv is rather overloaded. As it looks like > traversing string character by character is valid operation for input as > well, so we cannot exclude tokens there. > > Let's put it differently - what we intend is to avoid passing bogus character > to mbrtowc(). If we *know* the context is tokenized we could just as well > pass a flag to itype_end() and mb_metacharlenconv() so they will check for > tokens and skip them. Does it actually make sense? Yes, that makes perfect sense. It's then a case of deciding on the context, but that's a lot less heavyweight than untokenizing. -- Peter Stephenson <pws@csr.com> Software Engineer CSR PLC, Churchill House, Cambridge Business Park, Cowley Road Cambridge, CB4 0WZ, UK Tel: +44 (0)1223 692070 To access the latest news from CSR copy this link into a web browser: http://www.csr.com/email_sig.php ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2006-09-27 16:52 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-09-25 17:45 mb_metacharlenconv vs. tokens Andrey Borzenkov 2006-09-26 9:20 ` Peter Stephenson 2006-09-26 18:03 ` Andrey Borzenkov 2006-09-26 18:10 ` Peter Stephenson 2006-09-27 16:31 ` Andrey Borzenkov 2006-09-27 16:51 ` Peter Stephenson
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).