From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 19662 invoked from network); 29 Sep 2006 16:37:33 -0000 X-Spam-Checker-Version: SpamAssassin 3.1.5 (2006-08-29) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00,DRUGS_MUSCLE, FORGED_RCVD_HELO autolearn=ham version=3.1.5 Received: from news.dotsrc.org (HELO a.mx.sunsite.dk) (130.225.247.88) by ns1.primenet.com.au with SMTP; 29 Sep 2006 16:37:33 -0000 Received-SPF: none (ns1.primenet.com.au: domain at sunsite.dk does not designate permitted sender hosts) Received: (qmail 68088 invoked from network); 29 Sep 2006 16:37:26 -0000 Received: from sunsite.dk (130.225.247.90) by a.mx.sunsite.dk with SMTP; 29 Sep 2006 16:37:26 -0000 Received: (qmail 10747 invoked by alias); 29 Sep 2006 16:37:22 -0000 Mailing-List: contact zsh-workers-help@sunsite.dk; run by ezmlm Precedence: bulk X-No-Archive: yes X-Seq: 22787 Received: (qmail 10735 invoked from network); 29 Sep 2006 16:37:20 -0000 Received: from news.dotsrc.org (HELO a.mx.sunsite.dk) (130.225.247.88) by sunsite.dk with SMTP; 29 Sep 2006 16:37:20 -0000 Received: (qmail 67723 invoked from network); 29 Sep 2006 16:37:20 -0000 Received: from flock1.newmail.ru (80.68.241.157) by a.mx.sunsite.dk with SMTP; 29 Sep 2006 16:37:19 -0000 Received: (qmail 27307 invoked from network); 29 Sep 2006 16:37:17 -0000 Received: from unknown (HELO cooker.local) (arvidjaar@newmail.ru@85.141.135.179) by smtpd.newmail.ru with SMTP; 29 Sep 2006 16:37:17 -0000 From: Andrey Borzenkov To: zsh-workers@sunsite.dk Subject: Re: Stuff to do Date: Fri, 29 Sep 2006 20:37:11 +0400 User-Agent: KMail/1.9.4 References: <200609271211.k8RCBW5N023914@news01.csr.com> In-Reply-To: <200609271211.k8RCBW5N023914@news01.csr.com> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart68212220.gGIBeDomvJ"; protocol="application/pgp-signature"; micalg=pgp-sha1 Content-Transfer-Encoding: 7bit Message-Id: <200609292037.17847.arvidjaar@newmail.ru> --nextPart68212220.gGIBeDomvJ Content-Type: multipart/mixed; boundary="Boundary-01=_6uUHFZSlP+ieXZw" Content-Transfer-Encoding: 7bit Content-Disposition: inline --Boundary-01=_6uUHFZSlP+ieXZw Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline On Wednesday 27 September 2006 16:11, Peter Stephenson wrote: > - The matcher specifications in completion don't handle multibyte > characters and are currently written in such a way as to make this > hard (similar to the old suffix character handling). I have been looking at it recently. So far the following issues came up. 1. matcher code assumes character =3D=3D byte and is using 256 bytes array = to=20 build character equivalence classes. What is worse, it is passing this arra= y=20 around between different functions to suppply results of previous matching.= I=20 have here patch (attached) that eliminates external dependency on this arra= y=20 so matcher internals can be more easily changed. This seems to make code a= =20 bit more understandable irrespectively :) OK to commit? 2. Usage of magic array for character classes ([abcd]) can be naturally=20 superceded by using either generic pattern matching or direct comparison.=20 Pattern matching provides for using something like [[:lower:]] and possibly= =20 using matchers etc but potential side effects of extended globbing need=20 review. I do not know what is faster. Is it OK? 3. Equivalence classes ({abcd}=3D{xyzw}) do not scale beyond single byte=20 characters. But if we check usage I believe, it has never been used for=20 anything beyond case-insensitive matching. For this particular usage I=20 suggest using new matcher type: m:LPAT>upper m:LPAT>lower with obvious semantic - character from line is converted to lower or upper = and=20 compared with character from potential match. So m:{a-z}=3D{A-Z} becomes=20 m:?>upper etc. We still can implement {...} for character _set_ but not for character rang= e.=20 So far I do not consider it major problem. 4. The hardest part. Right anchor. For this matcher must match _backward_. = I=20 am not aware of any way to walk backward as long as we assume arbitrary=20 encoding. Options apparently are a) careful modification of code to compute line and patter length and try t= o=20 match from the (llen - plen) point. After all we know that we do not need t= o=20 match more than that. This may be doable, I did not yet try. b) convert this code to use wide characters. Not sure if this is a viable=20 option. c) Use UTF-8 :) It is no joke - I expect nowadays 99% of all systems using= =20 multibyte encoding use UTF-8. So we may concentrate on most commonly used=20 case and think about other encoding later if anyone really needs it. As soo= n=20 as we assume input be in UTF-8 we immediately get enormous advantages =2D it can be easily traversed both backwards and forwards (meta adds some= =20 complexity but not much) =2D length is known in advance, we can save mbrtowc() calls =2D if we know that system is using UCS-4 for wide characters (and I guess = we do=20 not support others at all) converting to/from wide character can be=20 implemented without calling mbrtowc() & Co. at all. =2D (remotely) this allows us to get rid of META which saves quite a bit of= =20 memory allocation overhead. Comments?=20 --Boundary-01=_6uUHFZSlP+ieXZw Content-Type: text/x-diff; charset="utf-8"; name="pattern_match.diff" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="pattern_match.diff" Index: Src/Zle/compmatch.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /cvsroot/zsh/zsh/Src/Zle/compmatch.c,v retrieving revision 1.48 diff -u -p -r1.48 compmatch.c =2D-- Src/Zle/compmatch.c 12 Jan 2006 00:51:51 -0000 1.48 +++ Src/Zle/compmatch.c 29 Sep 2006 16:00:31 -0000 @@ -461,7 +461,6 @@ match_str(char *l, char *w, Brinfo *bpp, { int ll =3D strlen(l), lw =3D strlen(w), oll =3D ll, olw =3D lw, exact = =3D 0, wexact =3D 0; int il =3D 0, iw =3D 0, t, ind, add, he =3D 0, bpc, obc =3D bc, bslash; =2D VARARR(unsigned char, ea, (ll > lw ? ll : lw) + 1); char *ow; Cmlist ms; Cmatcher mp, lm =3D NULL; @@ -795,9 +794,7 @@ match_str(char *l, char *w, Brinfo *bpp, (il || iw))); } /* Now try to match the line and word patterns. */ =2D if (!t || =2D !pattern_match(mp->line, tl, NULL, ea) || =2D !pattern_match(mp->word, tw, ea, NULL)) + if (!t || !pattern_match(mp->line, tl, mp->word, tw)) continue; =20 /* Probably add the matched strings. */ @@ -1091,44 +1088,51 @@ comp_match(char *pfx, char *sfx, char *w } =20 /* Check if the given pattern matches the given string. * =2D * `in' and `out' are used for {...} classes. In `out' we store the * =2D * character number that was matched. In the word pattern this is * =2D * given in `in' so that we can easily test if we found the * =2D * corresponding character. */ + * p and s are either anchor or line pattern and string; + * wp and ws are word (candidate) pattern and string + * + * If only one pattern is given, we just check if characters match + * If both line and word are given, we check that characters match + * for {...} classes by comparing relative numbers in sequence. + * + * Patterns and strings are always passed in pairs, so it is enough + * to check for non-NULL wp. p should always be present. + */ =20 /**/ mod_export int =2Dpattern_match(Cpattern p, char *s, unsigned char *in, unsigned char *out) +pattern_match(Cpattern p, char *s, Cpattern wp, char *ws) { unsigned char c; + unsigned char wc; =20 =2D while (p) { =2D c =3D *((unsigned char *) s); + while (p && wp && *s && *ws) { + c =3D p->tab[*((unsigned char *) s)]; + wc =3D wp->tab[*((unsigned char *) ws)]; =20 =2D if (out) =2D *out =3D 0; =2D =2D if (p->equiv) { =2D if (in) { =2D c =3D p->tab[c]; =2D if ((*in && *in !=3D c) || (!*in && !c)) =2D return 0; =2D } else if (out) { =2D if (!(*out =3D p->tab[c])) =2D return 0; =2D } else if (!p->tab[c]) =2D return 0; =2D =2D if (in && *in) =2D in++; =2D if (out) =2D out++; =2D } else if (!p->tab[c]) + if (!c || !wc || c !=3D wc) return 0; =20 s++; + ws++; p =3D p->next; + wp =3D wp->next; + } + + while (p && *s) { + if (!p->tab[*((unsigned char *) s)]) + return 0; + p =3D p->next; + s++; } + + while (wp && *ws) { + if (!wp->tab[*((unsigned char *) ws)]) + return 0; + wp =3D wp->next; + ws++; + } + return 1; } =20 @@ -1214,38 +1218,48 @@ bld_parts(char *str, int len, int plen,=20 * buffer line. Initially line is the same as lp, but during recursive * calls lp is incremented for storing successive characters. Whenever * a full possible string is build, we test if this line matches the =2D * string given by wlen and word. The in argument contains the characters =2D * to use for the correspondence classes, it was filled by a call to=20 =2D * pattern_match() in the calling function. + * string given by wlen and word. + * + * wpat contains pattern that matched previously + * lpat contains the pattern for line we build + * mword is a string that matched wpat before + * word is string that we try to match now + * * The return value is the length of the string matched in the word, it =2D * is zero if we couldn't build a line that matches the word. */ + * is zero if we couldn't build a line that matches the word. + */ + =20 /**/ static int =2Dbld_line(Cpattern pat, char *line, char *lp, =2D char *word, int wlen, unsigned char *in, int sfx) +bld_line(Cpattern wpat, Cpattern lpat, char *line, char *lp, + char *mword, char *word, int wlen, int sfx) { =2D if (pat) { + if (lpat) { /* Still working on the pattern. */ =20 int i, l; unsigned char c =3D 0; =20 /* Get the number of the character for a correspondence class =2D * if it has a correxponding class. */ =2D if (pat->equiv) =2D if ((c =3D *in)) =2D in++; + * if it has a corresponding class. */ + if (lpat->equiv) + if (wpat && *mword) { + c =3D wpat->tab[STOUC(*mword)]; + wpat =3D wpat->next; + mword++; + } + =20 /* Walk through the table in the pattern and try the characters * that may appear in the current position. */ for (i =3D 0; i < 256; i++) =2D if ((pat->equiv && c) ? (c =3D=3D pat->tab[i]) : pat->tab[i]) { + if ((lpat->equiv && c) ? (c =3D=3D lpat->tab[i]) : lpat->tab[i]) { *lp =3D i; /* We stored the character, now call ourselves to build * the rest. */ =2D if ((l =3D bld_line(pat->next, line, lp + 1, word, wlen, =2D in, sfx))) + if ((l =3D bld_line(wpat, lpat->next, line, lp + 1, + mword, word, wlen, sfx))) return l; } } else { @@ -1255,7 +1269,6 @@ bld_line(Cpattern pat, char *line, char=20 Cmlist ms; Cmatcher mp; int l =3D lp - line, t, rl =3D 0, ind, add; =2D VARARR(unsigned char, ea, l + 1); =20 /* Quick test if the strings are exactly the same. */ if (l =3D=3D wlen && !strncmp(line, word, l)) @@ -1279,9 +1292,7 @@ bld_line(Cpattern pat, char *line, char=20 mp =3D ms->matcher; if (mp && !mp->flags && mp->wlen <=3D wlen && mp->llen <=3D l && pattern_match(mp->line, (sfx ? line - mp->llen : line), =2D NULL, ea) && =2D pattern_match(mp->word, (sfx ? word - mp->wlen : word), =2D ea, NULL)) { + mp->word, (sfx ? word - mp->wlen : word))) { /* Both the line and the word pattern matched, * now skip over the matched portions. */ if (sfx) { @@ -1316,7 +1327,6 @@ join_strs(int la, char *sa, int lb, char static char *rs =3D NULL; static int rl =3D 0; =20 =2D VARARR(unsigned char, ea, (la > lb ? la : lb) + 1); Cmlist ms; Cmatcher mp; int t, bl, rr =3D rl; @@ -1331,8 +1341,7 @@ join_strs(int la, char *sa, int lb, char mp->wlen <=3D la && mp->wlen <=3D lb) { /* The pattern has no anchors and the word * pattern fits, try it. */ =2D if ((t =3D pattern_match(mp->word, sa, NULL, ea)) || =2D pattern_match(mp->word, sb, NULL, ea)) { + if ((t =3D pattern_match(mp->word, sa, mp->word, sb))) { /* It matched one of the strings, t says which one. */ VARARR(char, line, mp->llen + 1); char **ap, **bp; @@ -1347,8 +1356,8 @@ join_strs(int la, char *sa, int lb, char } /* Now try to build a string that matches the other * string. */ =2D if ((bl =3D bld_line(mp->line, line, line, =2D *bp, *blp, ea, 0))) { + if ((bl =3D bld_line(mp->word, mp->line, line, line, + *ap, *bp, *blp, 0))) { /* Found one, put it into the return string. */ line[mp->llen] =3D '\0'; if (rr <=3D mp->llen) { @@ -1503,7 +1512,6 @@ join_sub(Cmdata md, char *str, int len,=20 int ol =3D len, nl =3D md->len; Cmlist ms; Cmatcher mp; =2D VARARR(unsigned char, ea, (ol > nl ? ol : nl) + 1); int t; =20 if (sfx) { @@ -1519,9 +1527,7 @@ join_sub(Cmdata md, char *str, int len,=20 * new one. */ if (mp->llen <=3D ol && mp->wlen <=3D nl && pattern_match(mp->line, ow - (sfx ? mp->llen : 0), =2D NULL, ea) && =2D pattern_match(mp->word, nw - (sfx ? mp->wlen : 0), =2D ea, NULL)) { + mp->word, nw - (sfx ? mp->wlen : 0))) { /* It did, update the contents of the cmdata struct * and return a cline for the matched part. */ if (sfx) @@ -1539,17 +1545,22 @@ join_sub(Cmdata md, char *str, int len,=20 * pattern matches one of the strings. */ if (join && mp->wlen <=3D ol && mp->wlen <=3D nl && ((t =3D pattern_match(mp->word, ow - (sfx ? mp->wlen : 0), =2D NULL, ea)) || + NULL, NULL)) || pattern_match(mp->word, nw - (sfx ? mp->wlen : 0), =2D NULL, ea))) { + NULL, NULL))) { VARARR(char, line, mp->llen + 1); int bl; + char *mw; =20 /* Then build all the possible lines and see * if one of them matches the other string. */ =2D if ((bl =3D bld_line(mp->line, line, line, =2D (t ? nw : ow), (t ? nl : ol), =2D ea, sfx))) { + if (t) + mw =3D ow - (sfx ? mp->wlen : 0); + else + mw =3D nw - (sfx ? mp->wlen : 0); + + if ((bl =3D bld_line(mp->word, mp->line, line, line, + mw, (t ? nw : ow), (t ? nl : ol), sfx))) { /* Yep, one of the lines matched the other * string. */ line[mp->llen] =3D '\0'; --Boundary-01=_6uUHFZSlP+ieXZw-- --nextPart68212220.gGIBeDomvJ Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD8DBQBFHUu9R6LMutpd94wRAsj2AJ9e4hUcsFjPoViUYHTyHa/Giu32aQCgs+po wicAzVnGIsX8v0NNXmTjdyk= =9OU2 -----END PGP SIGNATURE----- --nextPart68212220.gGIBeDomvJ--