From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 22983 invoked from network); 11 Oct 2006 17:55:12 -0000 X-Spam-Checker-Version: SpamAssassin 3.1.6 (2006-10-03) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00, FORGED_RCVD_HELO autolearn=ham version=3.1.6 Received: from news.dotsrc.org (HELO a.mx.sunsite.dk) (130.225.247.88) by ns1.primenet.com.au with SMTP; 11 Oct 2006 17:55:12 -0000 Received-SPF: none (ns1.primenet.com.au: domain at sunsite.dk does not designate permitted sender hosts) Received: (qmail 43474 invoked from network); 11 Oct 2006 17:54:44 -0000 Received: from sunsite.dk (130.225.247.90) by a.mx.sunsite.dk with SMTP; 11 Oct 2006 17:54:44 -0000 Received: (qmail 19997 invoked by alias); 11 Oct 2006 17:54:38 -0000 Mailing-List: contact zsh-workers-help@sunsite.dk; run by ezmlm Precedence: bulk X-No-Archive: yes X-Seq: 22867 Received: (qmail 19987 invoked from network); 11 Oct 2006 17:54:37 -0000 Received: from news.dotsrc.org (HELO a.mx.sunsite.dk) (130.225.247.88) by sunsite.dk with SMTP; 11 Oct 2006 17:54:37 -0000 Received: (qmail 42801 invoked from network); 11 Oct 2006 17:54:37 -0000 Received: from flock1.newmail.ru (80.68.241.157) by a.mx.sunsite.dk with SMTP; 11 Oct 2006 17:54:35 -0000 Received: (qmail 2708 invoked from network); 11 Oct 2006 17:54:34 -0000 Received: from unknown (HELO cooker.local) (arvidjaar@newmail.ru@83.237.228.122) by smtpd.newmail.ru with SMTP; 11 Oct 2006 17:54:34 -0000 From: Andrey Borzenkov To: zsh-workers@sunsite.dk Subject: Re: quest for bld_line (was: Re: Stuff to do) Date: Wed, 11 Oct 2006 21:54:27 +0400 User-Agent: KMail/1.9.4 References: <200609271211.k8RCBW5N023914@news01.csr.com> <20061009130024.76925819.pws@csr.com> <200610092028.50896.arvidjaar@newmail.ru> In-Reply-To: <200610092028.50896.arvidjaar@newmail.ru> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart3673570.NB86IHfcFN"; protocol="application/pgp-signature"; micalg=pgp-sha1 Content-Transfer-Encoding: 7bit Message-Id: <200610112154.33016.arvidjaar@newmail.ru> --nextPart3673570.NB86IHfcFN Content-Type: multipart/mixed; boundary="Boundary-01=_W/SLF0hxhsULHUz" Content-Transfer-Encoding: 7bit Content-Disposition: inline --Boundary-01=_W/SLF0hxhsULHUz Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline On Monday 09 October 2006 20:28, Andrey Borzenkov wrote: > On Monday 09 October 2006 16:00, Peter Stephenson wrote: > > Andrey Borzenkov wrote: > > > The point of patch is to replace exhaustive enumeration of all possib= le > > > combinations by comparison of patterns. I.e. it checks if two patterns > > > may have something in common - this can be generalized later using > > > different pattern representations. > > > > > > I would be happy if we could just toss away this function. > > > > > > Comments? > > > > I don't know enough about this stage of completion to comment in any > > detail, but it certainly looks like we need to do something about > > it... > > to avoid duplication of efforts - I know now how to do it properly (that > is, to preserve full functionality without relying on current fixed 256 > byte tables), I just have to find some time to write it down. > OK so here is the updated patch; all tests passed; unfortunately I still do= =20 not know how to write regression tests that would specifically check this=20 function. I dare to presume that code became legible enough to be understandable now.= It=20 checks if it can match next character(s) and then recursively tries to matc= h=20 the rest. The net effect is that instead of answering "how to find all=20 characters matching given pattern" we have to answer "can two patterns matc= h=20 the same character". It is obvious for literal character, ? and equivalence= =20 class; it is also obvious for future [:upper:] and [:lower:] patterns; and = it=20 is straightforward (albeit not necessarily fast) for [...] set. Question - how does zsh internally represent [...] now? If this is a (set o= f)=20 bit strings, comparison is trivial. Comments (now attached)? --Boundary-01=_W/SLF0hxhsULHUz Content-Type: text/x-diff; charset="iso-8859-1"; name="bld_line.diff" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="bld_line.diff" Index: Src/Zle/compmatch.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /cvsroot/zsh/zsh/Src/Zle/compmatch.c,v retrieving revision 1.50 diff -u -p -r1.50 compmatch.c =2D-- Src/Zle/compmatch.c 30 Sep 2006 06:53:15 -0000 1.50 +++ Src/Zle/compmatch.c 11 Oct 2006 17:43:58 -0000 @@ -1087,51 +1087,70 @@ comp_match(char *pfx, char *sfx, char *w return r; } =20 =2D/* Check if the given pattern matches the given string. * =2D * p and s are either anchor or line pattern and string; +/* + * Try to match single pair of characters from line and word + * Either of line or word patterns (and characters) may be missing + */ +static int +cm_try_single(Cpattern lp, char lc, Cpattern wp, char wc) +{ + char li =3D 0, wi =3D 0; + + if (lp) { + li =3D lp->tab[STOUC(lc)]; + if (!li) + return 0; + } + if (wp) { + wi =3D wp->tab[STOUC(wc)]; + if (!wi) + return 0; + } + + /* Equiv is expected to be defined on both sides */ + if (lp && lp->equiv && wp && wp->equiv && li !=3D wi) + return 0; + + return 1; +} + +/* + * Check if the given pattern matches the given string. * + * lp and ls are either anchor or line pattern and string; * wp and ws are word (candidate) pattern and string =2D * =2D * If only one pattern is given, we just check if characters match =2D * If both line and word are given, we check that characters match =2D * for {...} classes by comparing relative numbers in sequence. =2D * =2D * Patterns and strings are always passed in pairs, so it is enough =2D * to check for non-NULL wp. p should always be present. */ =20 /**/ mod_export int =2Dpattern_match(Cpattern p, char *s, Cpattern wp, char *ws) +pattern_match(Cpattern lp, char *ls, Cpattern wp, char *ws) { =2D unsigned char c; =2D unsigned char wc; =2D =2D while (p && wp && *s && *ws) { =2D c =3D p->tab[*((unsigned char *) s)]; =2D wc =3D wp->tab[*((unsigned char *) ws)]; =2D =2D if (!c || !wc || c !=3D wc) + while (lp && wp && *ls && *ws) { + if (!cm_try_single(lp, *ls, wp, *ws)) return 0; =20 =2D s++; + ls++; ws++; =2D p =3D p->next; + lp =3D lp->next; wp =3D wp->next; } =20 =2D while (p && *s) { =2D if (!p->tab[*((unsigned char *) s)]) + while (lp && *ls) { + if (!cm_try_single(lp, *ls, NULL, '\0')) return 0; =2D p =3D p->next; =2D s++; + lp =3D lp->next; + ls++; } =20 while (wp && *ws) { =2D if (!wp->tab[*((unsigned char *) ws)]) + if (!cm_try_single(NULL, '\0', wp, *ws)) return 0; wp =3D wp->next; ws++; } + =20 + /* If any line is shorter than pattern, return falure */ + if ((lp && !*ls) || (wp && !*ws)) + return 0; =20 return 1; } @@ -1214,106 +1233,164 @@ bld_parts(char *str, int len, int plen,=20 return ret; } =20 =2D/* This builds all the possible line patterns for the pattern pat in the =2D * buffer line. Initially line is the same as lp, but during recursive =2D * calls lp is incremented for storing successive characters. Whenever =2D * a full possible string is build, we test if this line matches the =2D * string given by wlen and word. +/* + * Compare two different line patterns if they can have some common charac= ter + * Insert one of common characters in line we are building (it does not ma= tter + * which one) + * mlp - line pattern which has matched before + * mwp - word pattern which has matched before + * nlp - new line pattern that we currently test against mlp + * nwp - new word pattern that we currently test against mwp + * line - line we build; we insert characters there + */ + +/**/ +static int +pattern_compare(Cpattern mlp, Cpattern mwp, Cpattern nlp, Cpattern nwp, + char *line) +{ + while (nlp) { + int i; + + /* + * test to see if mlp and nlp have something in commons + * nlp cannot be less than mlp (we check pattern length before) + * but word pattern may of course be shorter than line ... + */ + for (i =3D 0; i < 256; i++) + if (mlp->tab[i] && nlp->tab[i]) { + /* for equiv. class they must also match word pattern */ + if (mlp->equiv) { + if (!mwp || !nwp || (mlp->tab[i] =3D=3D mwp->tab[i] && + nlp->tab[i] =3D=3D nwp->tab[i])) + break; + } else + break; + } + if (i < 256) { + /* OK we found character that matches both matchers */ + *line++ =3D (char)i; + } else { + /* No matching character */ + return 0; + } + mlp =3D mlp->next; + nlp =3D nlp->next; + if (mwp) mwp =3D mwp->next; + if (nwp) nwp =3D nwp->next; + } + + return 1; +} + +/* This tries to find out, if there is common line that may match two + * words (possible matches or parts thereof). When this function is called, + * it is ensured that `mword' has matched word pattern in `matcher'; + * we try to find a string that both matches line pattern in `matcher' + * and another word `word' * =2D * wpat contains pattern that matched previously =2D * lpat contains the pattern for line we build =2D * mword is a string that matched wpat before =2D * word is string that we try to match now + * matcher - matcher that `mword' has been matched against + * line - buffer for string we build + * lmatch - length of line built so far + * wmatch - length of word matching the above + * mword - word that has matched word pattern in `matcher' before + * word - is string that we try to match now + * wlen - length of `word' + * sfx - if we should match bacwards * =2D * The return value is the length of the string matched in the word, it + * The return value is the length of the string matched in the `word', it * is zero if we couldn't build a line that matches the word. */ =20 =2D /**/ static int =2Dbld_line(Cpattern wpat, Cpattern lpat, char *line, char *lp, +bld_line(Cmatcher matcher, char *line, int lmatch, int wmatch, char *mword, char *word, int wlen, int sfx) { =2D if (lpat) { =2D /* Still working on the pattern. */ =2D =2D int i, l; =2D unsigned char c =3D 0; + static Cpattern *mlpa; + static Cpattern *mwpa; + Cmlist ms; + Cmatcher mp; + Cpattern pat; + char *lp; + int llen =3D matcher->llen, rl =3D 0, i, clpos, cwpos; =20 =2D /* Get the number of the character for a correspondence class =2D * if it has a corresponding class. */ =2D if (lpat->equiv) =2D if (wpat && *mword) { =2D c =3D wpat->tab[STOUC(*mword)]; =2D wpat =3D wpat->next; =2D mword++; =2D } + if (!lmatch) { + /* + * Setup array instead of list; this is required for suffix matchs + * Array size is that of line; we are not interested in patterns + * above this length + */ + mlpa =3D zalloc(sizeof(Cpattern) * llen); + mwpa =3D zalloc(sizeof(Cpattern) * llen); =20 + for (i =3D 0, pat =3D matcher->line; pat; i++, pat =3D pat->next) + mlpa[i] =3D pat; + for (i =3D 0, pat =3D matcher->word; pat && i < llen; i++, pat =3D pat->n= ext) + mwpa[i] =3D pat; + /* In case mword is smaller than line ... */ + for (i =3D matcher->wlen; i < llen; i++) + mwpa[i] =3D NULL; + } =20 =2D /* Walk through the table in the pattern and try the characters =2D * that may appear in the current position. */ =2D for (i =3D 0; i < 256; i++) =2D if ((lpat->equiv && c) ? (c =3D=3D lpat->tab[i]) : lpat->tab[i]) { =2D *lp =3D i; =2D /* We stored the character, now call ourselves to build =2D * the rest. */ =2D if ((l =3D bld_line(wpat, lpat->next, line, lp + 1, =2D mword, word, wlen, sfx))) =2D return l; =2D } + if (sfx) { + lp =3D line + llen - lmatch; + clpos =3D llen - lmatch - 1; + cwpos =3D llen - wmatch - 1; } else { =2D /* We reached the end, i.e. the line string is fully build, now =2D * see if it matches the given word. */ =2D =2D Cmlist ms; =2D Cmatcher mp; =2D int l =3D lp - line, t, rl =3D 0, ind, add; =2D =2D /* Quick test if the strings are exactly the same. */ =2D if (l =3D=3D wlen && !strncmp(line, word, l)) =2D return l; =2D =2D if (sfx) { =2D line =3D lp; word +=3D wlen; =2D ind =3D -1; add =3D -1; =2D } else { =2D ind =3D 0; add =3D 1; + lp =3D line + lmatch; + clpos =3D lmatch; + cwpos =3D wmatch; + } + + /* + * First try if next character in word fits mword directly; + * if yes, try to build the rest of line + * Do it only if word is at least as large as line + */ + if (wmatch + 1 <=3D llen && + cm_try_single(mlpa[clpos], word[clpos], + mwpa[clpos], mword[clpos])) { + /* last character? */ + if (lmatch + 1 =3D=3D llen) { + rl =3D 1; + goto cleanup; } =2D /* We loop through the whole line string built. */ =2D while (l && wlen) { =2D if (word[ind] =3D=3D line[ind]) { =2D /* The same character in both strings, skip over. */ =2D line +=3D add; word +=3D add; =2D l--; wlen--; rl++; =2D } else { =2D t =3D 0; =2D for (ms =3D bmatchers; ms && !t; ms =3D ms->next) { =2D mp =3D ms->matcher; =2D if (mp && !mp->flags && mp->wlen <=3D wlen && mp->llen <=3D l && =2D pattern_match(mp->line, (sfx ? line - mp->llen : line), =2D mp->word, (sfx ? word - mp->wlen : word))) { =2D /* Both the line and the word pattern matched, =2D * now skip over the matched portions. */ =2D if (sfx) { =2D line -=3D mp->llen; word -=3D mp->wlen; =2D } else { =2D line +=3D mp->llen; word +=3D mp->wlen; =2D } =2D l -=3D mp->llen; wlen -=3D mp->wlen; rl +=3D mp->wlen; =2D t =3D 1; =2D } =2D } =2D if (!t) =2D /* Didn't match, give up. */ =2D return 0; + if ((rl =3D bld_line(matcher, line, lmatch + 1, wmatch + 1, + mword, word, wlen, sfx)) >=3D 0) { + rl++; + goto cleanup; + } + } + + /* nope; try each matcher in turn and see if we can match with it */ + for (ms =3D bmatchers; ms; ms =3D ms->next) { + mp =3D ms->matcher; + if (mp && !mp->flags && mp->wlen <=3D wlen - wmatch && mp->llen <=3D llen= - lmatch && + pattern_match(mp->word, (sfx ? word - wmatch - mp->wlen : word + wmat= ch), + NULL, NULL) && + pattern_compare(mlpa[clpos], mwpa[clpos], mp->line, mp->word, + (sfx ? lp - mp->llen : lp))) { + /* Both the line and the word pattern matched, + * now try to match the rest */ + rl =3D bld_line(matcher, line, lmatch + mp->llen, wmatch + mp->wlen, + mword, word, wlen, sfx); + if (rl) { + rl +=3D mp->wlen; + goto cleanup; } } =2D if (!l) =2D /* Unmatched portion in the line built, return matched length. */ =2D return rl; } =2D return 0; + + /* Nothing match; return failure */ + +cleanup: + if (!lmatch) { + zfree(mlpa, sizeof(Cpattern) * matcher->llen); + zfree(mwpa, sizeof(Cpattern) * matcher->wlen); + } + + return rl; } =20 /* This builds a string that may be put on the line that fully matches the @@ -1357,7 +1434,7 @@ join_strs(int la, char *sa, int lb, char } /* Now try to build a string that matches the other * string. */ =2D if ((bl =3D bld_line(mp->word, mp->line, line, line, + if ((bl =3D bld_line(mp, line, 0, 0, *ap, *bp, *blp, 0))) { /* Found one, put it into the return string. */ line[mp->llen] =3D '\0'; @@ -1560,7 +1637,7 @@ join_sub(Cmdata md, char *str, int len,=20 else mw =3D nw - (sfx ? mp->wlen : 0); =20 =2D if ((bl =3D bld_line(mp->word, mp->line, line, line, + if ((bl =3D bld_line(mp, line, 0, 0, mw, (t ? nw : ow), (t ? nl : ol), sfx))) { /* Yep, one of the lines matched the other * string. */ --Boundary-01=_W/SLF0hxhsULHUz-- --nextPart3673570.NB86IHfcFN Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD8DBQBFLS/YR6LMutpd94wRAjpjAJ9hbWFk+dUVOPBuBekGRrMWOqVkkwCggQCY 5q1bm0W1ivaNbe0mai4tHTE= =nUM+ -----END PGP SIGNATURE----- --nextPart3673570.NB86IHfcFN--