From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <zsh-workers-return-22787-mason-zsh=primenet.com.au@sunsite.dk>
Received: (qmail 19662 invoked from network); 29 Sep 2006 16:37:33 -0000
X-Spam-Checker-Version: SpamAssassin 3.1.5 (2006-08-29) on f.primenet.com.au
X-Spam-Level: 
X-Spam-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00,DRUGS_MUSCLE,
	FORGED_RCVD_HELO autolearn=ham version=3.1.5
Received: from news.dotsrc.org (HELO a.mx.sunsite.dk) (130.225.247.88)
  by ns1.primenet.com.au with SMTP; 29 Sep 2006 16:37:33 -0000
Received-SPF: none (ns1.primenet.com.au: domain at sunsite.dk does not designate permitted sender hosts)
Received: (qmail 68088 invoked from network); 29 Sep 2006 16:37:26 -0000
Received: from sunsite.dk (130.225.247.90)
  by a.mx.sunsite.dk with SMTP; 29 Sep 2006 16:37:26 -0000
Received: (qmail 10747 invoked by alias); 29 Sep 2006 16:37:22 -0000
Mailing-List: contact zsh-workers-help@sunsite.dk; run by ezmlm
Precedence: bulk
X-No-Archive: yes
X-Seq: 22787
Received: (qmail 10735 invoked from network); 29 Sep 2006 16:37:20 -0000
Received: from news.dotsrc.org (HELO a.mx.sunsite.dk) (130.225.247.88)
  by sunsite.dk with SMTP; 29 Sep 2006 16:37:20 -0000
Received: (qmail 67723 invoked from network); 29 Sep 2006 16:37:20 -0000
Received: from flock1.newmail.ru (80.68.241.157)
  by a.mx.sunsite.dk with SMTP; 29 Sep 2006 16:37:19 -0000
Received: (qmail 27307 invoked from network); 29 Sep 2006 16:37:17 -0000
Received: from unknown (HELO cooker.local) (arvidjaar@newmail.ru@85.141.135.179)
  by smtpd.newmail.ru with SMTP; 29 Sep 2006 16:37:17 -0000
From: Andrey Borzenkov <arvidjaar@newmail.ru>
To: zsh-workers@sunsite.dk
Subject: Re: Stuff to do
Date: Fri, 29 Sep 2006 20:37:11 +0400
User-Agent: KMail/1.9.4
References: <200609271211.k8RCBW5N023914@news01.csr.com>
In-Reply-To: <200609271211.k8RCBW5N023914@news01.csr.com>
MIME-Version: 1.0
Content-Type: multipart/signed;
  boundary="nextPart68212220.gGIBeDomvJ";
  protocol="application/pgp-signature";
  micalg=pgp-sha1
Content-Transfer-Encoding: 7bit
Message-Id: <200609292037.17847.arvidjaar@newmail.ru>

--nextPart68212220.gGIBeDomvJ
Content-Type: multipart/mixed;
  boundary="Boundary-01=_6uUHFZSlP+ieXZw"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

--Boundary-01=_6uUHFZSlP+ieXZw
Content-Type: text/plain;
  charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

On Wednesday 27 September 2006 16:11, Peter Stephenson wrote:
> - The matcher specifications in completion don't handle multibyte
> characters and are currently written in such a way as to make this
> hard (similar to the old suffix character handling).

I have been looking at it recently. So far the following issues came up.

1. matcher code assumes character =3D=3D byte and is using 256 bytes array =
to=20
build character equivalence classes. What is worse, it is passing this arra=
y=20
around between different functions to suppply results of previous matching.=
 I=20
have here patch (attached) that eliminates external dependency on this arra=
y=20
so matcher internals can be more easily changed. This seems to make code a=
=20
bit more understandable irrespectively :) OK to commit?

2. Usage of magic array for character classes ([abcd]) can be naturally=20
superceded by using either generic pattern matching or direct comparison.=20
Pattern matching provides for using something like [[:lower:]] and possibly=
=20
using matchers etc but potential side effects of extended globbing need=20
review. I do not know what is faster. Is it OK?

3. Equivalence classes ({abcd}=3D{xyzw}) do not scale beyond single byte=20
characters. But if we check usage I believe, it has never been used for=20
anything beyond case-insensitive matching. For this particular usage I=20
suggest using new matcher type:

m:LPAT>upper
m:LPAT>lower

with obvious semantic - character from line is converted to lower or upper =
and=20
compared with character from potential match. So m:{a-z}=3D{A-Z} becomes=20
m:?>upper etc.

We still can implement {...} for character _set_ but not for character rang=
e.=20
So far I do not consider it major problem.

4. The hardest part. Right anchor. For this matcher must match _backward_. =
I=20
am not aware of any way to walk backward as long as we assume arbitrary=20
encoding. Options apparently are

a) careful modification of code to compute line and patter length and try t=
o=20
match from the (llen - plen) point. After all we know that we do not need t=
o=20
match more than that. This may be doable, I did not yet try.

b) convert this code to use wide characters. Not sure if this is a viable=20
option.

c) Use UTF-8 :) It is no joke - I expect nowadays 99% of all systems using=
=20
multibyte encoding use UTF-8. So we may concentrate on most commonly used=20
case and think about other encoding later if anyone really needs it. As soo=
n=20
as we assume input be in UTF-8 we immediately get enormous advantages

=2D it can be easily traversed both backwards and forwards (meta adds some=
=20
complexity but not much)
=2D length is known in advance, we can save mbrtowc() calls
=2D if we know that system is using UCS-4 for wide characters (and I guess =
we do=20
not support others at all) converting to/from wide character can be=20
implemented without calling mbrtowc() & Co. at all.
=2D (remotely) this allows us to get rid of META which saves quite a bit of=
=20
memory allocation overhead.

Comments?=20

--Boundary-01=_6uUHFZSlP+ieXZw
Content-Type: text/x-diff;
  charset="utf-8";
  name="pattern_match.diff"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
	filename="pattern_match.diff"

Index: Src/Zle/compmatch.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
RCS file: /cvsroot/zsh/zsh/Src/Zle/compmatch.c,v
retrieving revision 1.48
diff -u -p -r1.48 compmatch.c
=2D-- Src/Zle/compmatch.c	12 Jan 2006 00:51:51 -0000	1.48
+++ Src/Zle/compmatch.c	29 Sep 2006 16:00:31 -0000
@@ -461,7 +461,6 @@ match_str(char *l, char *w, Brinfo *bpp,
 {
     int ll =3D strlen(l), lw =3D strlen(w), oll =3D ll, olw =3D lw, exact =
=3D 0, wexact =3D 0;
     int il =3D 0, iw =3D 0, t, ind, add, he =3D 0, bpc, obc =3D bc, bslash;
=2D    VARARR(unsigned char, ea, (ll > lw ? ll : lw) + 1);
     char *ow;
     Cmlist ms;
     Cmatcher mp, lm =3D NULL;
@@ -795,9 +794,7 @@ match_str(char *l, char *w, Brinfo *bpp,
 					  (il || iw)));
 		    }
 		    /* Now try to match the line and word patterns. */
=2D		    if (!t ||
=2D			!pattern_match(mp->line, tl, NULL, ea) ||
=2D			!pattern_match(mp->word, tw, ea, NULL))
+		    if (!t || !pattern_match(mp->line, tl, mp->word, tw))
 			continue;
=20
 		    /* Probably add the matched strings. */
@@ -1091,44 +1088,51 @@ comp_match(char *pfx, char *sfx, char *w
 }
=20
 /* Check if the given pattern matches the given string.             *
=2D * `in' and `out' are used for {...} classes. In `out' we store the *
=2D * character number that was matched. In the word pattern this is   *
=2D * given in `in' so that we can easily test if we found the         *
=2D * corresponding character. */
+ *  p and  s are either anchor or line pattern and string;
+ * wp and ws are word (candidate) pattern and string
+ *
+ * If only one pattern is given, we just check if characters match
+ * If both line and word are given, we check that characters match
+ * for {...} classes by comparing relative numbers in sequence.
+ *
+ * Patterns and strings are always passed in pairs, so it is enough
+ * to check for non-NULL wp. p should always be present.
+ */
=20
 /**/
 mod_export int
=2Dpattern_match(Cpattern p, char *s, unsigned char *in, unsigned char *out)
+pattern_match(Cpattern p, char *s, Cpattern wp, char *ws)
 {
     unsigned char c;
+    unsigned char wc;
=20
=2D    while (p) {
=2D	c =3D *((unsigned char *) s);
+    while (p && wp && *s && *ws) {
+	c =3D p->tab[*((unsigned char *) s)];
+	wc =3D wp->tab[*((unsigned char *) ws)];
=20
=2D	if (out)
=2D	    *out =3D 0;
=2D
=2D	if (p->equiv) {
=2D	    if (in) {
=2D		c =3D p->tab[c];
=2D		if ((*in && *in !=3D c) || (!*in && !c))
=2D		    return 0;
=2D	    } else if (out) {
=2D		if (!(*out =3D p->tab[c]))
=2D		    return 0;
=2D	    } else if (!p->tab[c])
=2D		return 0;
=2D
=2D	    if (in && *in)
=2D		in++;
=2D	    if (out)
=2D		out++;
=2D	} else if (!p->tab[c])
+	if (!c || !wc || c !=3D wc)
 	    return 0;
=20
 	s++;
+	ws++;
 	p =3D p->next;
+	wp =3D wp->next;
+    }
+
+    while (p && *s) {
+	if (!p->tab[*((unsigned char *) s)])
+	    return 0;
+	p =3D p->next;
+	s++;
     }
+
+    while (wp && *ws) {
+	if (!wp->tab[*((unsigned char *) ws)])
+	    return 0;
+	wp =3D wp->next;
+	ws++;
+    }
+
     return 1;
 }
=20
@@ -1214,38 +1218,48 @@ bld_parts(char *str, int len, int plen,=20
  * buffer line. Initially line is the same as lp, but during recursive
  * calls lp is incremented for storing successive characters. Whenever
  * a full possible string is build, we test if this line matches the
=2D * string given by wlen and word. The in argument contains the characters
=2D * to use for the correspondence classes, it was filled by a call to=20
=2D * pattern_match() in the calling function.
+ * string given by wlen and word.
+ *
+ * wpat contains pattern that matched previously
+ * lpat contains the pattern for line we build
+ * mword is a string that matched wpat before
+ * word is string that we try to match now
+ *
  * The return value is the length of the string matched in the word, it
=2D * is zero if we couldn't build a line that matches the word. */
+ * is zero if we couldn't build a line that matches the word.
+ */
+
=20
 /**/
 static int
=2Dbld_line(Cpattern pat, char *line, char *lp,
=2D	 char *word, int wlen, unsigned char *in, int sfx)
+bld_line(Cpattern wpat, Cpattern lpat, char *line, char *lp,
+	 char *mword, char *word, int wlen, int sfx)
 {
=2D    if (pat) {
+    if (lpat) {
 	/* Still working on the pattern. */
=20
 	int i, l;
 	unsigned char c =3D 0;
=20
 	/* Get the number of the character for a correspondence class
=2D	 * if it has a correxponding class. */
=2D	if (pat->equiv)
=2D	    if ((c =3D *in))
=2D		in++;
+	 * if it has a corresponding class. */
+	if (lpat->equiv)
+	    if (wpat && *mword) {
+		c =3D wpat->tab[STOUC(*mword)];
+		wpat =3D wpat->next;
+		mword++;
+	    }
+
=20
 	/* Walk through the table in the pattern and try the characters
 	 * that may appear in the current position. */
 	for (i =3D 0; i < 256; i++)
=2D	    if ((pat->equiv && c) ? (c =3D=3D pat->tab[i]) : pat->tab[i]) {
+	    if ((lpat->equiv && c) ? (c =3D=3D lpat->tab[i]) : lpat->tab[i]) {
 		*lp =3D i;
 		/* We stored the character, now call ourselves to build
 		 * the rest. */
=2D		if ((l =3D bld_line(pat->next, line, lp + 1, word, wlen,
=2D				  in, sfx)))
+		if ((l =3D bld_line(wpat, lpat->next, line, lp + 1,
+				  mword, word, wlen, sfx)))
 		    return l;
 	    }
     } else {
@@ -1255,7 +1269,6 @@ bld_line(Cpattern pat, char *line, char=20
 	Cmlist ms;
 	Cmatcher mp;
 	int l =3D lp - line, t, rl =3D 0, ind, add;
=2D	VARARR(unsigned char, ea, l + 1);
=20
 	/* Quick test if the strings are exactly the same. */
 	if (l =3D=3D wlen && !strncmp(line, word, l))
@@ -1279,9 +1292,7 @@ bld_line(Cpattern pat, char *line, char=20
 		    mp =3D ms->matcher;
 		    if (mp && !mp->flags && mp->wlen <=3D wlen && mp->llen <=3D l &&
 			pattern_match(mp->line, (sfx ? line - mp->llen : line),
=2D				      NULL, ea) &&
=2D			pattern_match(mp->word, (sfx ? word - mp->wlen : word),
=2D				      ea, NULL)) {
+				      mp->word, (sfx ? word - mp->wlen : word))) {
 			/* Both the line and the word pattern matched,
 			 * now skip over the matched portions. */
 			if (sfx) {
@@ -1316,7 +1327,6 @@ join_strs(int la, char *sa, int lb, char
     static char *rs =3D NULL;
     static int rl =3D 0;
=20
=2D    VARARR(unsigned char, ea, (la > lb ? la : lb) + 1);
     Cmlist ms;
     Cmatcher mp;
     int t, bl, rr =3D rl;
@@ -1331,8 +1341,7 @@ join_strs(int la, char *sa, int lb, char
 		    mp->wlen <=3D la && mp->wlen <=3D lb) {
 		    /* The pattern has no anchors and the word
 		     * pattern fits, try it. */
=2D		    if ((t =3D pattern_match(mp->word, sa, NULL, ea)) ||
=2D			pattern_match(mp->word, sb, NULL, ea)) {
+		    if ((t =3D pattern_match(mp->word, sa, mp->word, sb))) {
 			/* It matched one of the strings, t says which one. */
 			VARARR(char, line, mp->llen + 1);
 			char **ap, **bp;
@@ -1347,8 +1356,8 @@ join_strs(int la, char *sa, int lb, char
 			}
 			/* Now try to build a string that matches the other
 			 * string. */
=2D			if ((bl =3D bld_line(mp->line, line, line,
=2D					   *bp, *blp, ea, 0))) {
+			if ((bl =3D bld_line(mp->word, mp->line, line, line,
+					   *ap, *bp, *blp, 0))) {
 			    /* Found one, put it into the return string. */
 			    line[mp->llen] =3D '\0';
 			    if (rr <=3D mp->llen) {
@@ -1503,7 +1512,6 @@ join_sub(Cmdata md, char *str, int len,=20
 	int ol =3D len, nl =3D md->len;
 	Cmlist ms;
 	Cmatcher mp;
=2D	VARARR(unsigned char, ea, (ol > nl ? ol : nl) + 1);
 	int t;
=20
 	if (sfx) {
@@ -1519,9 +1527,7 @@ join_sub(Cmdata md, char *str, int len,=20
 		 * new one. */
 		if (mp->llen <=3D ol && mp->wlen <=3D nl &&
 		    pattern_match(mp->line, ow - (sfx ? mp->llen : 0),
=2D				  NULL, ea) &&
=2D		    pattern_match(mp->word, nw - (sfx ? mp->wlen : 0),
=2D				  ea, NULL)) {
+				  mp->word, nw - (sfx ? mp->wlen : 0))) {
 		    /* It did, update the contents of the cmdata struct
 		     * and return a cline for the matched part. */
 		    if (sfx)
@@ -1539,17 +1545,22 @@ join_sub(Cmdata md, char *str, int len,=20
 		 * pattern matches one of the strings. */
 		if (join && mp->wlen <=3D ol && mp->wlen <=3D nl &&
 		    ((t =3D pattern_match(mp->word, ow - (sfx ? mp->wlen : 0),
=2D				       NULL, ea)) ||
+				       NULL, NULL)) ||
 		     pattern_match(mp->word, nw - (sfx ? mp->wlen : 0),
=2D				   NULL, ea))) {
+				   NULL, NULL))) {
 		    VARARR(char, line, mp->llen + 1);
 		    int bl;
+		    char *mw;
=20
 		    /* Then build all the possible lines and see
 		     * if one of them matches the other string. */
=2D		    if ((bl =3D bld_line(mp->line, line, line,
=2D				       (t ? nw : ow), (t ? nl : ol),
=2D				       ea, sfx))) {
+		    if (t)
+			mw =3D ow - (sfx ? mp->wlen : 0);
+		    else
+			mw =3D nw - (sfx ? mp->wlen : 0);
+
+		    if ((bl =3D bld_line(mp->word, mp->line, line, line,
+				       mw, (t ? nw : ow), (t ? nl : ol), sfx)))  {
 			/* Yep, one of the lines matched the other
 			 * string. */
 			line[mp->llen] =3D '\0';

--Boundary-01=_6uUHFZSlP+ieXZw--

--nextPart68212220.gGIBeDomvJ
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQBFHUu9R6LMutpd94wRAsj2AJ9e4hUcsFjPoViUYHTyHa/Giu32aQCgs+po
wicAzVnGIsX8v0NNXmTjdyk=
=9OU2
-----END PGP SIGNATURE-----

--nextPart68212220.gGIBeDomvJ--