* Stuff to do @ 2006-09-27 12:11 Peter Stephenson 2006-09-27 13:09 ` DervishD ` (3 more replies) 0 siblings, 4 replies; 12+ messages in thread From: Peter Stephenson @ 2006-09-27 12:11 UTC (permalink / raw) To: Zsh hackers list I'm aware of the following things that need fixing, not including problems we aren't currently able to debug directly (crashes and hangs on certain systems). They are all problems that show up particularly when dealing with non-ASCII characters, but the last two aren't actually specific to that: - Finish making padding multibyte-aware: it's not done for printf nor for padding using typeset flags. - The matcher specifications in completion don't handle multibyte characters and are currently written in such a way as to make this hard (similar to the old suffix character handling). - $' quoting is badly handled in completion. - More untokenization is needed; certainly in parameter substitution and quite possibly in random other places including completion. No doubt more will show up. This isn't going to be quick to get right. -- Peter Stephenson <pws@csr.com> Software Engineer CSR PLC, Churchill House, Cambridge Business Park, Cowley Road Cambridge, CB4 0WZ, UK Tel: +44 (0)1223 692070 To access the latest news from CSR copy this link into a web browser: http://www.csr.com/email_sig.php ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Stuff to do 2006-09-27 12:11 Stuff to do Peter Stephenson @ 2006-09-27 13:09 ` DervishD 2006-09-29 3:20 ` Bart Schaefer 2006-09-29 16:37 ` Andrey Borzenkov ` (2 subsequent siblings) 3 siblings, 1 reply; 12+ messages in thread From: DervishD @ 2006-09-27 13:09 UTC (permalink / raw) To: Peter Stephenson; +Cc: Zsh hackers list Hi Peter :) * Peter Stephenson <pws@csr.com> dixit: > No doubt more will show up. This isn't going to be quick to get right. Peter, you're doing a superb job maintaining, fixing and improving the mess that the zsh code is. The same goes to the other contributors: you're geniuses (evil geniuses, of course XDD). I've read some of the code in zsh and sincerely, I would start it from scratch. Maintaining such messy code is one of the worst things I have in my worst nightmares. So, if this is not going to be quick, it doesn't matter at all. Even if you stop maintaining zsh, it doesn't matter at all. We should be more than thankful to you and the other contributors for what you've done until now. It's the kind of work that it impossible to pay, and the reason I make free software. Thanks a lot for your hard work: zsh is just fantastic and although far from perfect, the people behind it are just fantastic. Go on :)) Raúl Núñez de Arenas Coronado -- Linux Registered User 88736 | http://www.dervishd.net It's my PC and I'll cry if I want to... RAmen! ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Stuff to do 2006-09-27 13:09 ` DervishD @ 2006-09-29 3:20 ` Bart Schaefer 2006-09-29 8:05 ` DervishD 0 siblings, 1 reply; 12+ messages in thread From: Bart Schaefer @ 2006-09-29 3:20 UTC (permalink / raw) To: Zsh hackers list On Sep 27, 3:09pm, DervishD wrote: } } I've read some of the code in zsh and sincerely, I would start } it from scratch. Having worked in the software industry for going on 20 years now, I can say with quite some confidence that once a piece of code reaches a certain level of complexity, this is almost always the wrong approach. Unless of course your goal is to abandon everything about the original program and re-create only a subset -- because the chances of ever getting back to where you started are very small. Zsh passed that level of complexity a very long time ago. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Stuff to do 2006-09-29 3:20 ` Bart Schaefer @ 2006-09-29 8:05 ` DervishD 0 siblings, 0 replies; 12+ messages in thread From: DervishD @ 2006-09-29 8:05 UTC (permalink / raw) To: Bart Schaefer; +Cc: Zsh hackers list Hi Bart :) * Bart Schaefer <schaefer@brasslantern.com> dixit: > On Sep 27, 3:09pm, DervishD wrote: > } I've read some of the code in zsh and sincerely, I would > } start it from scratch. > > Having worked in the software industry for going on 20 years now, I > can say with quite some confidence that once a piece of code > reaches a certain level of complexity, this is almost always the > wrong approach. I think that this depends on how messy is the code in the project. I mean, if the code is so messy (and that doesn't mean complex) that every time you touch anything you wreck havoc, then doing something from start is a good idea (IMHO). I suppose that it's a question of evaluating maintainability versus the effort of starting from scratch. If a project is barely maintainable, sometimes it's better to start from scratch, but again that depends on how complex the design is, of course. > Zsh passed that level of complexity a very long time ago. I know, and of course I'm not asking you to start the code from scratch, I'm just saying that *I* would do it ;)) I have a project that too passed that level of complexity a time ago, and in fact I left the project in the hands of the other co-author because I wasn't motivated enough to continue, and probably rewriting it will be a complete stupidity, but if I had the time, I swear I'll rewrite it from scratch. I simply hate maintaining code that tries to resist every try of being maintained and which contains pieces that nobody understands XD With Zsh, I think that I would use a subset of it, as you suggest, if I would start a "new-Zsh". Rewriting all from the scratch is probably impossible now, because if you want to create a bug-compatible version of "new-Zsh" you will probably have to use the same code, so... Raúl Núñez de Arenas Coronado -- Linux Registered User 88736 | http://www.dervishd.net It's my PC and I'll cry if I want to... RAmen! ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Stuff to do 2006-09-27 12:11 Stuff to do Peter Stephenson 2006-09-27 13:09 ` DervishD @ 2006-09-29 16:37 ` Andrey Borzenkov 2006-09-29 17:08 ` Peter Stephenson 2006-09-29 18:08 ` Andrey Borzenkov 2006-10-08 15:38 ` quest for bld_line (was: Re: Stuff to do) Andrey Borzenkov 3 siblings, 1 reply; 12+ messages in thread From: Andrey Borzenkov @ 2006-09-29 16:37 UTC (permalink / raw) To: zsh-workers [-- Attachment #1.1: Type: text/plain, Size: 3008 bytes --] On Wednesday 27 September 2006 16:11, Peter Stephenson wrote: > - The matcher specifications in completion don't handle multibyte > characters and are currently written in such a way as to make this > hard (similar to the old suffix character handling). I have been looking at it recently. So far the following issues came up. 1. matcher code assumes character == byte and is using 256 bytes array to build character equivalence classes. What is worse, it is passing this array around between different functions to suppply results of previous matching. I have here patch (attached) that eliminates external dependency on this array so matcher internals can be more easily changed. This seems to make code a bit more understandable irrespectively :) OK to commit? 2. Usage of magic array for character classes ([abcd]) can be naturally superceded by using either generic pattern matching or direct comparison. Pattern matching provides for using something like [[:lower:]] and possibly using matchers etc but potential side effects of extended globbing need review. I do not know what is faster. Is it OK? 3. Equivalence classes ({abcd}={xyzw}) do not scale beyond single byte characters. But if we check usage I believe, it has never been used for anything beyond case-insensitive matching. For this particular usage I suggest using new matcher type: m:LPAT>upper m:LPAT>lower with obvious semantic - character from line is converted to lower or upper and compared with character from potential match. So m:{a-z}={A-Z} becomes m:?>upper etc. We still can implement {...} for character _set_ but not for character range. So far I do not consider it major problem. 4. The hardest part. Right anchor. For this matcher must match _backward_. I am not aware of any way to walk backward as long as we assume arbitrary encoding. Options apparently are a) careful modification of code to compute line and patter length and try to match from the (llen - plen) point. After all we know that we do not need to match more than that. This may be doable, I did not yet try. b) convert this code to use wide characters. Not sure if this is a viable option. c) Use UTF-8 :) It is no joke - I expect nowadays 99% of all systems using multibyte encoding use UTF-8. So we may concentrate on most commonly used case and think about other encoding later if anyone really needs it. As soon as we assume input be in UTF-8 we immediately get enormous advantages - it can be easily traversed both backwards and forwards (meta adds some complexity but not much) - length is known in advance, we can save mbrtowc() calls - if we know that system is using UCS-4 for wide characters (and I guess we do not support others at all) converting to/from wide character can be implemented without calling mbrtowc() & Co. at all. - (remotely) this allows us to get rid of META which saves quite a bit of memory allocation overhead. Comments? [-- Attachment #1.2: pattern_match.diff --] [-- Type: text/x-diff, Size: 8852 bytes --] Index: Src/Zle/compmatch.c =================================================================== RCS file: /cvsroot/zsh/zsh/Src/Zle/compmatch.c,v retrieving revision 1.48 diff -u -p -r1.48 compmatch.c --- Src/Zle/compmatch.c 12 Jan 2006 00:51:51 -0000 1.48 +++ Src/Zle/compmatch.c 29 Sep 2006 16:00:31 -0000 @@ -461,7 +461,6 @@ match_str(char *l, char *w, Brinfo *bpp, { int ll = strlen(l), lw = strlen(w), oll = ll, olw = lw, exact = 0, wexact = 0; int il = 0, iw = 0, t, ind, add, he = 0, bpc, obc = bc, bslash; - VARARR(unsigned char, ea, (ll > lw ? ll : lw) + 1); char *ow; Cmlist ms; Cmatcher mp, lm = NULL; @@ -795,9 +794,7 @@ match_str(char *l, char *w, Brinfo *bpp, (il || iw))); } /* Now try to match the line and word patterns. */ - if (!t || - !pattern_match(mp->line, tl, NULL, ea) || - !pattern_match(mp->word, tw, ea, NULL)) + if (!t || !pattern_match(mp->line, tl, mp->word, tw)) continue; /* Probably add the matched strings. */ @@ -1091,44 +1088,51 @@ comp_match(char *pfx, char *sfx, char *w } /* Check if the given pattern matches the given string. * - * `in' and `out' are used for {...} classes. In `out' we store the * - * character number that was matched. In the word pattern this is * - * given in `in' so that we can easily test if we found the * - * corresponding character. */ + * p and s are either anchor or line pattern and string; + * wp and ws are word (candidate) pattern and string + * + * If only one pattern is given, we just check if characters match + * If both line and word are given, we check that characters match + * for {...} classes by comparing relative numbers in sequence. + * + * Patterns and strings are always passed in pairs, so it is enough + * to check for non-NULL wp. p should always be present. + */ /**/ mod_export int -pattern_match(Cpattern p, char *s, unsigned char *in, unsigned char *out) +pattern_match(Cpattern p, char *s, Cpattern wp, char *ws) { unsigned char c; + unsigned char wc; - while (p) { - c = *((unsigned char *) s); + while (p && wp && *s && *ws) { + c = p->tab[*((unsigned char *) s)]; + wc = wp->tab[*((unsigned char *) ws)]; - if (out) - *out = 0; - - if (p->equiv) { - if (in) { - c = p->tab[c]; - if ((*in && *in != c) || (!*in && !c)) - return 0; - } else if (out) { - if (!(*out = p->tab[c])) - return 0; - } else if (!p->tab[c]) - return 0; - - if (in && *in) - in++; - if (out) - out++; - } else if (!p->tab[c]) + if (!c || !wc || c != wc) return 0; s++; + ws++; p = p->next; + wp = wp->next; + } + + while (p && *s) { + if (!p->tab[*((unsigned char *) s)]) + return 0; + p = p->next; + s++; } + + while (wp && *ws) { + if (!wp->tab[*((unsigned char *) ws)]) + return 0; + wp = wp->next; + ws++; + } + return 1; } @@ -1214,38 +1218,48 @@ bld_parts(char *str, int len, int plen, * buffer line. Initially line is the same as lp, but during recursive * calls lp is incremented for storing successive characters. Whenever * a full possible string is build, we test if this line matches the - * string given by wlen and word. The in argument contains the characters - * to use for the correspondence classes, it was filled by a call to - * pattern_match() in the calling function. + * string given by wlen and word. + * + * wpat contains pattern that matched previously + * lpat contains the pattern for line we build + * mword is a string that matched wpat before + * word is string that we try to match now + * * The return value is the length of the string matched in the word, it - * is zero if we couldn't build a line that matches the word. */ + * is zero if we couldn't build a line that matches the word. + */ + /**/ static int -bld_line(Cpattern pat, char *line, char *lp, - char *word, int wlen, unsigned char *in, int sfx) +bld_line(Cpattern wpat, Cpattern lpat, char *line, char *lp, + char *mword, char *word, int wlen, int sfx) { - if (pat) { + if (lpat) { /* Still working on the pattern. */ int i, l; unsigned char c = 0; /* Get the number of the character for a correspondence class - * if it has a correxponding class. */ - if (pat->equiv) - if ((c = *in)) - in++; + * if it has a corresponding class. */ + if (lpat->equiv) + if (wpat && *mword) { + c = wpat->tab[STOUC(*mword)]; + wpat = wpat->next; + mword++; + } + /* Walk through the table in the pattern and try the characters * that may appear in the current position. */ for (i = 0; i < 256; i++) - if ((pat->equiv && c) ? (c == pat->tab[i]) : pat->tab[i]) { + if ((lpat->equiv && c) ? (c == lpat->tab[i]) : lpat->tab[i]) { *lp = i; /* We stored the character, now call ourselves to build * the rest. */ - if ((l = bld_line(pat->next, line, lp + 1, word, wlen, - in, sfx))) + if ((l = bld_line(wpat, lpat->next, line, lp + 1, + mword, word, wlen, sfx))) return l; } } else { @@ -1255,7 +1269,6 @@ bld_line(Cpattern pat, char *line, char Cmlist ms; Cmatcher mp; int l = lp - line, t, rl = 0, ind, add; - VARARR(unsigned char, ea, l + 1); /* Quick test if the strings are exactly the same. */ if (l == wlen && !strncmp(line, word, l)) @@ -1279,9 +1292,7 @@ bld_line(Cpattern pat, char *line, char mp = ms->matcher; if (mp && !mp->flags && mp->wlen <= wlen && mp->llen <= l && pattern_match(mp->line, (sfx ? line - mp->llen : line), - NULL, ea) && - pattern_match(mp->word, (sfx ? word - mp->wlen : word), - ea, NULL)) { + mp->word, (sfx ? word - mp->wlen : word))) { /* Both the line and the word pattern matched, * now skip over the matched portions. */ if (sfx) { @@ -1316,7 +1327,6 @@ join_strs(int la, char *sa, int lb, char static char *rs = NULL; static int rl = 0; - VARARR(unsigned char, ea, (la > lb ? la : lb) + 1); Cmlist ms; Cmatcher mp; int t, bl, rr = rl; @@ -1331,8 +1341,7 @@ join_strs(int la, char *sa, int lb, char mp->wlen <= la && mp->wlen <= lb) { /* The pattern has no anchors and the word * pattern fits, try it. */ - if ((t = pattern_match(mp->word, sa, NULL, ea)) || - pattern_match(mp->word, sb, NULL, ea)) { + if ((t = pattern_match(mp->word, sa, mp->word, sb))) { /* It matched one of the strings, t says which one. */ VARARR(char, line, mp->llen + 1); char **ap, **bp; @@ -1347,8 +1356,8 @@ join_strs(int la, char *sa, int lb, char } /* Now try to build a string that matches the other * string. */ - if ((bl = bld_line(mp->line, line, line, - *bp, *blp, ea, 0))) { + if ((bl = bld_line(mp->word, mp->line, line, line, + *ap, *bp, *blp, 0))) { /* Found one, put it into the return string. */ line[mp->llen] = '\0'; if (rr <= mp->llen) { @@ -1503,7 +1512,6 @@ join_sub(Cmdata md, char *str, int len, int ol = len, nl = md->len; Cmlist ms; Cmatcher mp; - VARARR(unsigned char, ea, (ol > nl ? ol : nl) + 1); int t; if (sfx) { @@ -1519,9 +1527,7 @@ join_sub(Cmdata md, char *str, int len, * new one. */ if (mp->llen <= ol && mp->wlen <= nl && pattern_match(mp->line, ow - (sfx ? mp->llen : 0), - NULL, ea) && - pattern_match(mp->word, nw - (sfx ? mp->wlen : 0), - ea, NULL)) { + mp->word, nw - (sfx ? mp->wlen : 0))) { /* It did, update the contents of the cmdata struct * and return a cline for the matched part. */ if (sfx) @@ -1539,17 +1545,22 @@ join_sub(Cmdata md, char *str, int len, * pattern matches one of the strings. */ if (join && mp->wlen <= ol && mp->wlen <= nl && ((t = pattern_match(mp->word, ow - (sfx ? mp->wlen : 0), - NULL, ea)) || + NULL, NULL)) || pattern_match(mp->word, nw - (sfx ? mp->wlen : 0), - NULL, ea))) { + NULL, NULL))) { VARARR(char, line, mp->llen + 1); int bl; + char *mw; /* Then build all the possible lines and see * if one of them matches the other string. */ - if ((bl = bld_line(mp->line, line, line, - (t ? nw : ow), (t ? nl : ol), - ea, sfx))) { + if (t) + mw = ow - (sfx ? mp->wlen : 0); + else + mw = nw - (sfx ? mp->wlen : 0); + + if ((bl = bld_line(mp->word, mp->line, line, line, + mw, (t ? nw : ow), (t ? nl : ol), sfx))) { /* Yep, one of the lines matched the other * string. */ line[mp->llen] = '\0'; [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Stuff to do 2006-09-29 16:37 ` Andrey Borzenkov @ 2006-09-29 17:08 ` Peter Stephenson 2006-09-29 18:08 ` Andrey Borzenkov 0 siblings, 1 reply; 12+ messages in thread From: Peter Stephenson @ 2006-09-29 17:08 UTC (permalink / raw) To: zsh-workers Andrey Borzenkov <arvidjaar@newmail.ru> wrote: > 1. matcher code assumes character == byte and is using 256 bytes array to > build character equivalence classes. What is worse, it is passing this array > around between different functions to suppply results of previous matching. I > have here patch (attached) that eliminates external dependency on this array > so matcher internals can be more easily changed. This seems to make code a > bit more understandable irrespectively :) OK to commit? Yes, the more the calling conventions are sanitized like this the better I like it. The references to external data are one of my worst nightmares. > 2. Usage of magic array for character classes ([abcd]) can be naturally > superceded by using either generic pattern matching or direct comparison. > Pattern matching provides for using something like [[:lower:]] and possibly > using matchers etc but potential side effects of extended globbing need > review. I do not know what is faster. Is it OK? I'd be quite keen on being able to do this by using globbing. I think the current uses of matcher specifications are limited enough (sometimes by necessity, as we're seeing) that an extension wouldn't be a problem for compatibility; however, I don't know how to mix this with the equivalence class stuff. It would be quite nice to keep it in one place in pattern.c, but I doubt if that's going to work with all the additions we need. > 3. Equivalence classes ({abcd}={xyzw}) do not scale beyond single byte > characters. But if we check usage I believe, it has never been used for > anything beyond case-insensitive matching. For this particular usage I > suggest using new matcher type: > > m:LPAT>upper > m:LPAT>lower > > with obvious semantic - character from line is converted to lower or upper and > compared with character from potential match. So m:{a-z}={A-Z} becomes > m:?>upper etc. > > We still can implement {...} for character _set_ but not for character range. > So far I do not consider it major problem. I think we'll need to keep it working for ASCII for compatibility, but not extending it to other characters is, as you say, not a big problem. However, maybe it's not a problem at all; see below. > 4. The hardest part. Right anchor. For this matcher must match _backward_. I > am not aware of any way to walk backward as long as we assume arbitrary > encoding. Options apparently are >... > b) convert this code to use wide characters. Not sure if this is a viable > option. This is the option I was thinking about, and it removes the range problem since it extends the ASCII logic in a natural way (it may be system dependent, but that's the absolute least of our worries). I don't think it's a problem using wide characters locally for the comparisons. Indeed, the pattern match code does all its character class stuff with wide characters (or kludged wide characters which are just the unsigned char values if a multibyte sequence doesn't convert). It doesn't really make sense to allow for unconvertible characters in matcher comparisons---it's great to be able to insert them on the command line in some fashion, but the matcher specs only make sense for characters that are convertible. The worst problem is that we lose the ability to do matching control where (say) much of the string is ASCII, and our match rules only use ASCII, but there are also characters that don't work in the current locale. I don't think this is a big issue and there are possible ways round: - partial conversion - convert them at this stage to $'\...' sequences instead of later - use marked wide characters where we record a byte that can't be converted --- any of which could be bolted on later. So I don't think that's a showstopper. I was wondering how much of the code we needed to convert to use wide characters, and vaguely came to the conclusion the only reasonable sane way was to do it fairly locally within the comparison function(s), since otherwise the interface to the rest of the completion system gets very hairy. However, I haven't actually looked at the code again since coming to that conclusion. However, if there's an easy way of doing it by another method, fine. I suspect there isn't. -- Peter Stephenson <pws@csr.com> Software Engineer CSR PLC, Churchill House, Cambridge Business Park, Cowley Road Cambridge, CB4 0WZ, UK Tel: +44 (0)1223 692070 To access the latest news from CSR copy this link into a web browser: http://www.csr.com/email_sig.php ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Stuff to do 2006-09-29 17:08 ` Peter Stephenson @ 2006-09-29 18:08 ` Andrey Borzenkov 0 siblings, 0 replies; 12+ messages in thread From: Andrey Borzenkov @ 2006-09-29 18:08 UTC (permalink / raw) To: zsh-workers -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Friday 29 September 2006 21:08, Peter Stephenson wrote: > > Yes, the more the calling conventions are sanitized like this the better > I like it. The references to external data are one of my worst > nightmares. > Ok I committed it. > > 2. Usage of magic array for character classes ([abcd]) can be naturally > > superceded by using either generic pattern matching or direct comparison. > > Pattern matching provides for using something like [[:lower:]] and > > possibly using matchers etc but potential side effects of extended > > globbing need review. I do not know what is faster. Is it OK? > > I'd be quite keen on being able to do this by using globbing. I think the > current uses of matcher specifications are limited enough (sometimes by > necessity, as we're seeing) that an extension wouldn't be a problem for > compatibility; however, I don't know how to mix this with the equivalence > class stuff. It would be quite nice to keep it in one place in pattern.c, > but I doubt if that's going to work with all the additions we need. > > > 3. Equivalence classes ({abcd}={xyzw}) do not scale beyond single byte > > characters. But if we check usage I believe, it has never been used for > > anything beyond case-insensitive matching. For this particular usage I > > suggest using new matcher type: > > > > m:LPAT>upper > > m:LPAT>lower > > > > with obvious semantic - character from line is converted to lower or > > upper and compared with character from potential match. So m:{a-z}={A-Z} > > becomes m:?>upper etc. > > > > We still can implement {...} for character _set_ but not for character > > range. So far I do not consider it major problem. > > I think we'll need to keep it working for ASCII for compatibility, yes, that for sure; the idea was actually check if character is < 256 and reject pattern otherwise. > but not > extending it to other characters is, as you say, not a big problem. > However, maybe it's not a problem at all; see below. > > > 4. The hardest part. Right anchor. For this matcher must match > > _backward_. I am not aware of any way to walk backward as long as we > > assume arbitrary encoding. Options apparently are > >... > > b) convert this code to use wide characters. Not sure if this is a viable > > option. > > This is the option I was thinking about, and it removes the range problem > since it extends the ASCII logic in a natural way (it may be system > dependent, but that's the absolute least of our worries). > It does not, unfortunately. For ranges that is. What you effectively suggest is to assume that for {a-z}={A-Z} p matches q if p-a == q-A. This is not actually true even for 8 bit EBCDIC (I won't eat my hat on it though :); for arbitrary encoding it is simply meaningless. This fails even for basic European plane (because ß has no upper counterpart); and as soon as you move to next plane (1xx) it stops working completely. We really need something more portable; the exact syntax is open to discussion of course but so far I like m:?>upper :) It is so zsh-ish obscure ... > I don't think it's a problem using wide characters locally for the > comparisons. Indeed, the pattern match code does all its character class > stuff with wide characters (or kludged wide characters which are just the > unsigned char values if a multibyte sequence doesn't convert). It doesn't > really make sense to allow for unconvertible characters in matcher > comparisons---it's great to be able to insert them on the command line in > some fashion, but the matcher specs only make sense for characters that are > convertible. > > The worst problem is that we lose the ability to do matching control where > (say) much of the string is ASCII, and our match rules only use ASCII, but > there are also characters that don't work in the current locale. I don't > think this is a big issue and there are possible ways round: > - partial conversion > - convert them at this stage to $'\...' sequences instead of later as long as we are able to distinguish between line with real $'...' it is fine. Are we? > - use marked wide characters where we record a byte that can't be converted That is probably better. This has to be implemented at some point anyway to allow input of arbitrary data (vared) > --- any of which could be bolted on later. So I don't think that's a > showstopper. > > I was wondering how much of the code we needed to convert to use wide > characters, and vaguely came to the conclusion the only reasonable sane way > was to do it fairly locally within the comparison function(s), Sure at some point pattern_match will have to convert to wide character to test for properties. It does not solve the l[-1] problem; and code is full of them. > since > otherwise the interface to the rest of the completion system gets very > hairy. Exactly. I tried to convert compamtch.c to wide characters. This itself is more or less mechanical task; but it adds insane amount of converting all over the other places. This is already slow as is :( > However, I haven't actually looked at the code again since > coming to that conclusion. > > However, if there's an easy way of doing it by another method, fine. I > suspect there isn't. it depends on your definition of easy :) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD8DBQFFHWEGR6LMutpd94wRAqqBAJ0b+XY0Fr24mg0LZpY/CxyMnKZgUwCgzRZE qirOdaNVPC4pk0wivdvEAvY= =ocDN -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Stuff to do 2006-09-27 12:11 Stuff to do Peter Stephenson 2006-09-27 13:09 ` DervishD 2006-09-29 16:37 ` Andrey Borzenkov @ 2006-09-29 18:08 ` Andrey Borzenkov 2006-10-08 15:38 ` quest for bld_line (was: Re: Stuff to do) Andrey Borzenkov 3 siblings, 0 replies; 12+ messages in thread From: Andrey Borzenkov @ 2006-09-29 18:08 UTC (permalink / raw) To: zsh-workers -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wednesday 27 September 2006 16:11, Peter Stephenson wrote: > I'm aware of the following things that need fixing, not including > problems we aren't currently able to debug directly (crashes and hangs > on certain systems). They are all problems that show up particularly > when dealing with non-ASCII characters, but the last two aren't actually > specific to that: > > - Finish making padding multibyte-aware: it's not done for printf nor > for padding using typeset flags. > - The matcher specifications in completion don't handle multibyte > characters and are currently written in such a way as to make this > hard (similar to the old suffix character handling). > - $' quoting is badly handled in completion. > - More untokenization is needed; certainly in parameter substitution and > quite possibly in random other places including completion. > > No doubt more will show up. - - vared does not work for binary data -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD8DBQFFHWEtR6LMutpd94wRAiUCAJ97uTgt6pc5znNmHDu/IMm93uffaQCcDQ6N aP1EqG7i4/CRLf7xS/0lIpA= =wS6Q -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 12+ messages in thread
* quest for bld_line (was: Re: Stuff to do) 2006-09-27 12:11 Stuff to do Peter Stephenson ` (2 preceding siblings ...) 2006-09-29 18:08 ` Andrey Borzenkov @ 2006-10-08 15:38 ` Andrey Borzenkov 2006-10-09 12:00 ` Peter Stephenson 3 siblings, 1 reply; 12+ messages in thread From: Andrey Borzenkov @ 2006-10-08 15:38 UTC (permalink / raw) To: zsh-workers -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wednesday 27 September 2006 16:11, Peter Stephenson wrote: > - The matcher specifications in completion don't handle multibyte > characters and are currently written in such a way as to make this > hard (similar to the old suffix character handling). OK here is next patch that does not fix the above but tries to remove one more obstacle for it. bld_line tries to find (and actually build) a line that can match two given words. It does so by building *all* possible lines that match one word and trying to match every built line against second word. Now the word "all" makes possibility to do the same for arbitrary character set rather abstract. I must admit that I still do not understand why Sven needed this function nor how line that it builds is used later. What I am confident in, the Clines that are built using this function are removed later in compresult and never appear anywhere on command line. I tried to invent some way to mimic it as close to original as I could. It is incomplete; nor am I sure if there any way to do it differently. The point of patch is to replace exhaustive enumeration of all possible combinations by comparison of patterns. I.e. it checks if two patterns may have something in common - this can be generalized later using different pattern representations. I would be happy if we could just toss away this function. Comments? Index: Src/Zle/compmatch.c =================================================================== RCS file: /cvsroot/zsh/zsh/Src/Zle/compmatch.c,v retrieving revision 1.50 diff -u -p -r1.50 compmatch.c - --- Src/Zle/compmatch.c 30 Sep 2006 06:53:15 -0000 1.50 +++ Src/Zle/compmatch.c 8 Oct 2006 15:27:11 -0000 @@ -1214,105 +1214,164 @@ bld_parts(char *str, int len, int plen, return ret; } - -/* This builds all the possible line patterns for the pattern pat in the - - * buffer line. Initially line is the same as lp, but during recursive - - * calls lp is incremented for storing successive characters. Whenever - - * a full possible string is build, we test if this line matches the - - * string given by wlen and word. +/* + * Compare two different line patterns if they can have some common character + * Insert one of common characters in line we are building (it does not matter + * which one) + * mlp - line pattern which has matched before + * mwp - word pattern which has matched before + * nlp - new line pattern that we currently test against mlp + * nwp - new word pattern that we currently test against mwp + * line - line we build; we insert characters there + */ + +/**/ +static int +pattern_compare(Cpattern mlp, Cpattern mwp, Cpattern nlp, Cpattern nwp, + char *line) +{ + while (nlp) { + int i; + + /* + * test to see if mlp and nlp have something in commons + * nlp cannot be less than mlp (we check pattern length before) + * but word pattern may of course be shorter than line ... + */ + for (i = 0; i < 256; i++) + if (mlp->tab[i] && nlp->tab[i]) { + /* for equiv. class they must also match word pattern */ + if (mlp->equiv) { + if (!mwp || !nwp || (mlp->tab[i] == mwp->tab[i] && + nlp->tab[i] == nwp->tab[i])) + break; + } else + break; + } + if (i < 256) { + /* OK we found character that matches both matchers */ + *line++ = (char)i; + } else { + /* No matching character */ + return 0; + } + /* FIXME can this be out of bounds? */ + mlp = mlp->next; + nlp = nlp->next; + if (mwp) mwp = mwp->next; + if (nwp) nwp = nwp->next; + } + + return 1; +} + +/* This tries to find out, if there is common line that may match two + * words (possible matches or parts thereof). When this function is called, + * it is ensured that `mword' has matched word pattern in `matcher'; + * we try to find a string that both matches line pattern in `matcher' + * and another word `word' * - - * wpat contains pattern that matched previously - - * lpat contains the pattern for line we build - - * mword is a string that matched wpat before - - * word is string that we try to match now + * matcher - matcher that `mword' has been matched against + * line - buffer for string we build + * mword - word that has matched word pattern in `matcher' before + * word - is string that we try to match now + * wlen - length of `word' + * sfx - if we should match bacwards * - - * The return value is the length of the string matched in the word, it + * The return value is the length of the string matched in the `word', it * is zero if we couldn't build a line that matches the word. + * + * FIXME implementation is incomplete. In particular, it won't catch + * the case when part of line would have been equal to `word' and part + * requires matchers. I cannot find a way to do it without exaustive + * building of all possible line's that cannot be done as long as patterns + * may contain arbitrary multibyte characters */ - - /**/ static int - -bld_line(Cpattern wpat, Cpattern lpat, char *line, char *lp, +bld_line(Cmatcher matcher, char *line, char *mword, char *word, int wlen, int sfx) { - - if (lpat) { - - /* Still working on the pattern. */ - - - - int i, l; - - unsigned char c = 0; - - - - /* Get the number of the character for a correspondence class - - * if it has a corresponding class. */ - - if (lpat->equiv) - - if (wpat && *mword) { - - c = wpat->tab[STOUC(*mword)]; - - wpat = wpat->next; - - mword++; - - } + VARARR(Cpattern, mlpa, matcher->llen); + VARARR(Cpattern, mwpa, matcher->wlen); + Cmlist ms; + Cmatcher mp; + Cpattern pat; + char *lp; + int l = matcher->llen, t, rl = 0, ind, add, il, iw, i; + + /* Quick test if word may be direct input line */ + if (l == wlen && + pattern_match(matcher->line, word, + matcher->word, mword)) { + strncpy(line, word, wlen); + line[l] = '\0'; + return l; + } + /* Setup array instead of list; this is required for suffix match */ + for (i = 0, pat = matcher->line; pat; i++, pat = pat->next) + mlpa[i] = pat; + for (i = 0, pat = matcher->word; pat; i++, pat = pat->next) + mwpa[i] = pat; - - /* Walk through the table in the pattern and try the characters - - * that may appear in the current position. */ - - for (i = 0; i < 256; i++) - - if ((lpat->equiv && c) ? (c == lpat->tab[i]) : lpat->tab[i]) { - - *lp = i; - - /* We stored the character, now call ourselves to build - - * the rest. */ - - if ((l = bld_line(wpat, lpat->next, line, lp + 1, - - mword, word, wlen, sfx))) - - return l; - - } + if (sfx) { + ind = -1; add = -1; + il = matcher->llen; + iw = matcher->wlen; + lp = line + il; word += wlen; } else { - - /* We reached the end, i.e. the line string is fully build, now - - * see if it matches the given word. */ - - - - Cmlist ms; - - Cmatcher mp; - - int l = lp - line, t, rl = 0, ind, add; - - - - /* Quick test if the strings are exactly the same. */ - - if (l == wlen && !strncmp(line, word, l)) - - return l; + ind = 0; add = 1; + il = iw = 0; + lp = line; + } - - if (sfx) { - - line = lp; word += wlen; - - ind = -1; add = -1; - - } else { - - ind = 0; add = 1; - - } - - /* We loop through the whole line string built. */ - - while (l && wlen) { - - if (word[ind] == line[ind]) { - - /* The same character in both strings, skip over. */ - - line += add; word += add; - - l--; wlen--; rl++; - - } else { - - t = 0; - - for (ms = bmatchers; ms && !t; ms = ms->next) { - - mp = ms->matcher; - - if (mp && !mp->flags && mp->wlen <= wlen && mp->llen <= l && - - pattern_match(mp->line, (sfx ? line - mp->llen : line), - - mp->word, (sfx ? word - mp->wlen : word))) { - - /* Both the line and the word pattern matched, - - * now skip over the matched portions. */ - - if (sfx) { - - line -= mp->llen; word -= mp->wlen; - - } else { - - line += mp->llen; word += mp->wlen; - - } - - l -= mp->llen; wlen -= mp->wlen; rl += mp->wlen; - - t = 1; + /* Loop through both words */ + while (l && wlen) { +#if 0 + /* FIXME this code is likely wrong and so is disabled for now */ + if (word[ind] == mword[ind]) { + /* The same character in both strings, add it to line and + * skip over. */ + lp[ind] = word[ind]; + lp += add; word += add; mword += add; + l--; wlen--; rl++; + } else +#endif + { + t = 0; + for (ms = bmatchers; ms && !t; ms = ms->next) { + mp = ms->matcher; + if (mp && !mp->flags && mp->wlen <= wlen && mp->llen <= l && + pattern_match(mp->word, (sfx ? word - mp->wlen : word), + NULL, NULL) && + pattern_compare(mlpa[sfx ? il - mp->llen : il], + mwpa[sfx ? iw - mp->wlen : iw], + mp->line, mp->word, + (sfx ? lp - mp->llen : lp))) { + /* Both the line and the word pattern matched, + * now skip over the matched portions. */ + if (sfx) { + lp -= mp->llen; word -= mp->wlen; + il -= mp->llen; iw -= mp->wlen; + } else { + lp += mp->llen; word += mp->wlen; + il += mp->llen; iw += mp->wlen; } + l -= mp->llen; wlen -= mp->wlen; rl += mp->wlen; + t = 1; } - - if (!t) - - /* Didn't match, give up. */ - - return 0; } + if (!t) + /* Didn't match, give up. */ + return 0; } - - if (!l) - - /* Unmatched portion in the line built, return matched length. */ - - return rl; } + if (!l) + /* Unmatched portion in the line built, return matched length. */ + return rl; + return 0; } @@ -1357,7 +1416,7 @@ join_strs(int la, char *sa, int lb, char } /* Now try to build a string that matches the other * string. */ - - if ((bl = bld_line(mp->word, mp->line, line, line, + if ((bl = bld_line(mp, line, *ap, *bp, *blp, 0))) { /* Found one, put it into the return string. */ line[mp->llen] = '\0'; @@ -1560,7 +1619,7 @@ join_sub(Cmdata md, char *str, int len, else mw = nw - (sfx ? mp->wlen : 0); - - if ((bl = bld_line(mp->word, mp->line, line, line, + if ((bl = bld_line(mp, line, mw, (t ? nw : ow), (t ? nl : ol), sfx))) { /* Yep, one of the lines matched the other * string. */ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD8DBQFFKRt9R6LMutpd94wRAqnyAJ0TDZPQf5XZGTiYyHgi7Kn7KRTxRACgqq2S eVFyUpbIOQljAVVl3VV1GnU= =jO/g -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: quest for bld_line (was: Re: Stuff to do) 2006-10-08 15:38 ` quest for bld_line (was: Re: Stuff to do) Andrey Borzenkov @ 2006-10-09 12:00 ` Peter Stephenson 2006-10-09 16:28 ` Andrey Borzenkov 0 siblings, 1 reply; 12+ messages in thread From: Peter Stephenson @ 2006-10-09 12:00 UTC (permalink / raw) To: Zsh hackers list Andrey Borzenkov <arvidjaar@newmail.ru> wrote: > The point of patch is to replace exhaustive enumeration of all possible > combinations by comparison of patterns. I.e. it checks if two patterns may > have something in common - this can be generalized later using different > pattern representations. > > I would be happy if we could just toss away this function. > > Comments? I don't know enough about this stage of completion to comment in any detail, but it certainly looks like we need to do something about it... I'll try this locally. (There's a word wrapped early on which is preventing it applying but the fix is obvious.) -- Peter Stephenson <pws@csr.com> Software Engineer CSR PLC, Churchill House, Cambridge Business Park, Cowley Road Cambridge, CB4 0WZ, UK Tel: +44 (0)1223 692070 To access the latest news from CSR copy this link into a web browser: http://www.csr.com/email_sig.php ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: quest for bld_line (was: Re: Stuff to do) 2006-10-09 12:00 ` Peter Stephenson @ 2006-10-09 16:28 ` Andrey Borzenkov 2006-10-11 17:54 ` Andrey Borzenkov 0 siblings, 1 reply; 12+ messages in thread From: Andrey Borzenkov @ 2006-10-09 16:28 UTC (permalink / raw) To: zsh-workers -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Monday 09 October 2006 16:00, Peter Stephenson wrote: > Andrey Borzenkov <arvidjaar@newmail.ru> wrote: > > The point of patch is to replace exhaustive enumeration of all possible > > combinations by comparison of patterns. I.e. it checks if two patterns > > may have something in common - this can be generalized later using > > different pattern representations. > > > > I would be happy if we could just toss away this function. > > > > Comments? > > I don't know enough about this stage of completion to comment in any > detail, but it certainly looks like we need to do something about > it... to avoid duplication of efforts - I know now how to do it properly (that is, to preserve full functionality without relying on current fixed 256 byte tables), I just have to find some time to write it down. > I'll try this locally. (There's a word wrapped early on which is > preventing it applying but the fix is obvious.) sorry for this; every time I hope kmail has more common sense to not do it -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD8DBQFFKnjCR6LMutpd94wRAj/NAJ9yLWog2MQ1ZY//ri6qEU2+Gao/RwCfbMo5 GtcjO5630G0KqQJQzC/VF1g= =A6m8 -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: quest for bld_line (was: Re: Stuff to do) 2006-10-09 16:28 ` Andrey Borzenkov @ 2006-10-11 17:54 ` Andrey Borzenkov 0 siblings, 0 replies; 12+ messages in thread From: Andrey Borzenkov @ 2006-10-11 17:54 UTC (permalink / raw) To: zsh-workers [-- Attachment #1.1: Type: text/plain, Size: 1774 bytes --] On Monday 09 October 2006 20:28, Andrey Borzenkov wrote: > On Monday 09 October 2006 16:00, Peter Stephenson wrote: > > Andrey Borzenkov <arvidjaar@newmail.ru> wrote: > > > The point of patch is to replace exhaustive enumeration of all possible > > > combinations by comparison of patterns. I.e. it checks if two patterns > > > may have something in common - this can be generalized later using > > > different pattern representations. > > > > > > I would be happy if we could just toss away this function. > > > > > > Comments? > > > > I don't know enough about this stage of completion to comment in any > > detail, but it certainly looks like we need to do something about > > it... > > to avoid duplication of efforts - I know now how to do it properly (that > is, to preserve full functionality without relying on current fixed 256 > byte tables), I just have to find some time to write it down. > OK so here is the updated patch; all tests passed; unfortunately I still do not know how to write regression tests that would specifically check this function. I dare to presume that code became legible enough to be understandable now. It checks if it can match next character(s) and then recursively tries to match the rest. The net effect is that instead of answering "how to find all characters matching given pattern" we have to answer "can two patterns match the same character". It is obvious for literal character, ? and equivalence class; it is also obvious for future [:upper:] and [:lower:] patterns; and it is straightforward (albeit not necessarily fast) for [...] set. Question - how does zsh internally represent [...] now? If this is a (set of) bit strings, comparison is trivial. Comments (now attached)? [-- Attachment #1.2: bld_line.diff --] [-- Type: text/x-diff, Size: 11322 bytes --] Index: Src/Zle/compmatch.c =================================================================== RCS file: /cvsroot/zsh/zsh/Src/Zle/compmatch.c,v retrieving revision 1.50 diff -u -p -r1.50 compmatch.c --- Src/Zle/compmatch.c 30 Sep 2006 06:53:15 -0000 1.50 +++ Src/Zle/compmatch.c 11 Oct 2006 17:43:58 -0000 @@ -1087,51 +1087,70 @@ comp_match(char *pfx, char *sfx, char *w return r; } -/* Check if the given pattern matches the given string. * - * p and s are either anchor or line pattern and string; +/* + * Try to match single pair of characters from line and word + * Either of line or word patterns (and characters) may be missing + */ +static int +cm_try_single(Cpattern lp, char lc, Cpattern wp, char wc) +{ + char li = 0, wi = 0; + + if (lp) { + li = lp->tab[STOUC(lc)]; + if (!li) + return 0; + } + if (wp) { + wi = wp->tab[STOUC(wc)]; + if (!wi) + return 0; + } + + /* Equiv is expected to be defined on both sides */ + if (lp && lp->equiv && wp && wp->equiv && li != wi) + return 0; + + return 1; +} + +/* + * Check if the given pattern matches the given string. * + * lp and ls are either anchor or line pattern and string; * wp and ws are word (candidate) pattern and string - * - * If only one pattern is given, we just check if characters match - * If both line and word are given, we check that characters match - * for {...} classes by comparing relative numbers in sequence. - * - * Patterns and strings are always passed in pairs, so it is enough - * to check for non-NULL wp. p should always be present. */ /**/ mod_export int -pattern_match(Cpattern p, char *s, Cpattern wp, char *ws) +pattern_match(Cpattern lp, char *ls, Cpattern wp, char *ws) { - unsigned char c; - unsigned char wc; - - while (p && wp && *s && *ws) { - c = p->tab[*((unsigned char *) s)]; - wc = wp->tab[*((unsigned char *) ws)]; - - if (!c || !wc || c != wc) + while (lp && wp && *ls && *ws) { + if (!cm_try_single(lp, *ls, wp, *ws)) return 0; - s++; + ls++; ws++; - p = p->next; + lp = lp->next; wp = wp->next; } - while (p && *s) { - if (!p->tab[*((unsigned char *) s)]) + while (lp && *ls) { + if (!cm_try_single(lp, *ls, NULL, '\0')) return 0; - p = p->next; - s++; + lp = lp->next; + ls++; } while (wp && *ws) { - if (!wp->tab[*((unsigned char *) ws)]) + if (!cm_try_single(NULL, '\0', wp, *ws)) return 0; wp = wp->next; ws++; } + + /* If any line is shorter than pattern, return falure */ + if ((lp && !*ls) || (wp && !*ws)) + return 0; return 1; } @@ -1214,106 +1233,164 @@ bld_parts(char *str, int len, int plen, return ret; } -/* This builds all the possible line patterns for the pattern pat in the - * buffer line. Initially line is the same as lp, but during recursive - * calls lp is incremented for storing successive characters. Whenever - * a full possible string is build, we test if this line matches the - * string given by wlen and word. +/* + * Compare two different line patterns if they can have some common character + * Insert one of common characters in line we are building (it does not matter + * which one) + * mlp - line pattern which has matched before + * mwp - word pattern which has matched before + * nlp - new line pattern that we currently test against mlp + * nwp - new word pattern that we currently test against mwp + * line - line we build; we insert characters there + */ + +/**/ +static int +pattern_compare(Cpattern mlp, Cpattern mwp, Cpattern nlp, Cpattern nwp, + char *line) +{ + while (nlp) { + int i; + + /* + * test to see if mlp and nlp have something in commons + * nlp cannot be less than mlp (we check pattern length before) + * but word pattern may of course be shorter than line ... + */ + for (i = 0; i < 256; i++) + if (mlp->tab[i] && nlp->tab[i]) { + /* for equiv. class they must also match word pattern */ + if (mlp->equiv) { + if (!mwp || !nwp || (mlp->tab[i] == mwp->tab[i] && + nlp->tab[i] == nwp->tab[i])) + break; + } else + break; + } + if (i < 256) { + /* OK we found character that matches both matchers */ + *line++ = (char)i; + } else { + /* No matching character */ + return 0; + } + mlp = mlp->next; + nlp = nlp->next; + if (mwp) mwp = mwp->next; + if (nwp) nwp = nwp->next; + } + + return 1; +} + +/* This tries to find out, if there is common line that may match two + * words (possible matches or parts thereof). When this function is called, + * it is ensured that `mword' has matched word pattern in `matcher'; + * we try to find a string that both matches line pattern in `matcher' + * and another word `word' * - * wpat contains pattern that matched previously - * lpat contains the pattern for line we build - * mword is a string that matched wpat before - * word is string that we try to match now + * matcher - matcher that `mword' has been matched against + * line - buffer for string we build + * lmatch - length of line built so far + * wmatch - length of word matching the above + * mword - word that has matched word pattern in `matcher' before + * word - is string that we try to match now + * wlen - length of `word' + * sfx - if we should match bacwards * - * The return value is the length of the string matched in the word, it + * The return value is the length of the string matched in the `word', it * is zero if we couldn't build a line that matches the word. */ - /**/ static int -bld_line(Cpattern wpat, Cpattern lpat, char *line, char *lp, +bld_line(Cmatcher matcher, char *line, int lmatch, int wmatch, char *mword, char *word, int wlen, int sfx) { - if (lpat) { - /* Still working on the pattern. */ - - int i, l; - unsigned char c = 0; + static Cpattern *mlpa; + static Cpattern *mwpa; + Cmlist ms; + Cmatcher mp; + Cpattern pat; + char *lp; + int llen = matcher->llen, rl = 0, i, clpos, cwpos; - /* Get the number of the character for a correspondence class - * if it has a corresponding class. */ - if (lpat->equiv) - if (wpat && *mword) { - c = wpat->tab[STOUC(*mword)]; - wpat = wpat->next; - mword++; - } + if (!lmatch) { + /* + * Setup array instead of list; this is required for suffix matchs + * Array size is that of line; we are not interested in patterns + * above this length + */ + mlpa = zalloc(sizeof(Cpattern) * llen); + mwpa = zalloc(sizeof(Cpattern) * llen); + for (i = 0, pat = matcher->line; pat; i++, pat = pat->next) + mlpa[i] = pat; + for (i = 0, pat = matcher->word; pat && i < llen; i++, pat = pat->next) + mwpa[i] = pat; + /* In case mword is smaller than line ... */ + for (i = matcher->wlen; i < llen; i++) + mwpa[i] = NULL; + } - /* Walk through the table in the pattern and try the characters - * that may appear in the current position. */ - for (i = 0; i < 256; i++) - if ((lpat->equiv && c) ? (c == lpat->tab[i]) : lpat->tab[i]) { - *lp = i; - /* We stored the character, now call ourselves to build - * the rest. */ - if ((l = bld_line(wpat, lpat->next, line, lp + 1, - mword, word, wlen, sfx))) - return l; - } + if (sfx) { + lp = line + llen - lmatch; + clpos = llen - lmatch - 1; + cwpos = llen - wmatch - 1; } else { - /* We reached the end, i.e. the line string is fully build, now - * see if it matches the given word. */ - - Cmlist ms; - Cmatcher mp; - int l = lp - line, t, rl = 0, ind, add; - - /* Quick test if the strings are exactly the same. */ - if (l == wlen && !strncmp(line, word, l)) - return l; - - if (sfx) { - line = lp; word += wlen; - ind = -1; add = -1; - } else { - ind = 0; add = 1; + lp = line + lmatch; + clpos = lmatch; + cwpos = wmatch; + } + + /* + * First try if next character in word fits mword directly; + * if yes, try to build the rest of line + * Do it only if word is at least as large as line + */ + if (wmatch + 1 <= llen && + cm_try_single(mlpa[clpos], word[clpos], + mwpa[clpos], mword[clpos])) { + /* last character? */ + if (lmatch + 1 == llen) { + rl = 1; + goto cleanup; } - /* We loop through the whole line string built. */ - while (l && wlen) { - if (word[ind] == line[ind]) { - /* The same character in both strings, skip over. */ - line += add; word += add; - l--; wlen--; rl++; - } else { - t = 0; - for (ms = bmatchers; ms && !t; ms = ms->next) { - mp = ms->matcher; - if (mp && !mp->flags && mp->wlen <= wlen && mp->llen <= l && - pattern_match(mp->line, (sfx ? line - mp->llen : line), - mp->word, (sfx ? word - mp->wlen : word))) { - /* Both the line and the word pattern matched, - * now skip over the matched portions. */ - if (sfx) { - line -= mp->llen; word -= mp->wlen; - } else { - line += mp->llen; word += mp->wlen; - } - l -= mp->llen; wlen -= mp->wlen; rl += mp->wlen; - t = 1; - } - } - if (!t) - /* Didn't match, give up. */ - return 0; + if ((rl = bld_line(matcher, line, lmatch + 1, wmatch + 1, + mword, word, wlen, sfx)) >= 0) { + rl++; + goto cleanup; + } + } + + /* nope; try each matcher in turn and see if we can match with it */ + for (ms = bmatchers; ms; ms = ms->next) { + mp = ms->matcher; + if (mp && !mp->flags && mp->wlen <= wlen - wmatch && mp->llen <= llen - lmatch && + pattern_match(mp->word, (sfx ? word - wmatch - mp->wlen : word + wmatch), + NULL, NULL) && + pattern_compare(mlpa[clpos], mwpa[clpos], mp->line, mp->word, + (sfx ? lp - mp->llen : lp))) { + /* Both the line and the word pattern matched, + * now try to match the rest */ + rl = bld_line(matcher, line, lmatch + mp->llen, wmatch + mp->wlen, + mword, word, wlen, sfx); + if (rl) { + rl += mp->wlen; + goto cleanup; } } - if (!l) - /* Unmatched portion in the line built, return matched length. */ - return rl; } - return 0; + + /* Nothing match; return failure */ + +cleanup: + if (!lmatch) { + zfree(mlpa, sizeof(Cpattern) * matcher->llen); + zfree(mwpa, sizeof(Cpattern) * matcher->wlen); + } + + return rl; } /* This builds a string that may be put on the line that fully matches the @@ -1357,7 +1434,7 @@ join_strs(int la, char *sa, int lb, char } /* Now try to build a string that matches the other * string. */ - if ((bl = bld_line(mp->word, mp->line, line, line, + if ((bl = bld_line(mp, line, 0, 0, *ap, *bp, *blp, 0))) { /* Found one, put it into the return string. */ line[mp->llen] = '\0'; @@ -1560,7 +1637,7 @@ join_sub(Cmdata md, char *str, int len, else mw = nw - (sfx ? mp->wlen : 0); - if ((bl = bld_line(mp->word, mp->line, line, line, + if ((bl = bld_line(mp, line, 0, 0, mw, (t ? nw : ow), (t ? nl : ol), sfx))) { /* Yep, one of the lines matched the other * string. */ [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2006-10-11 17:55 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-09-27 12:11 Stuff to do Peter Stephenson 2006-09-27 13:09 ` DervishD 2006-09-29 3:20 ` Bart Schaefer 2006-09-29 8:05 ` DervishD 2006-09-29 16:37 ` Andrey Borzenkov 2006-09-29 17:08 ` Peter Stephenson 2006-09-29 18:08 ` Andrey Borzenkov 2006-09-29 18:08 ` Andrey Borzenkov 2006-10-08 15:38 ` quest for bld_line (was: Re: Stuff to do) Andrey Borzenkov 2006-10-09 12:00 ` Peter Stephenson 2006-10-09 16:28 ` Andrey Borzenkov 2006-10-11 17:54 ` Andrey Borzenkov
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).