zsh-workers
 help / color / mirror / code / Atom feed
* Stuff to do
@ 2006-09-27 12:11 Peter Stephenson
  2006-09-27 13:09 ` DervishD
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Peter Stephenson @ 2006-09-27 12:11 UTC (permalink / raw)
  To: Zsh hackers list

I'm aware of the following things that need fixing, not including
problems we aren't currently able to debug directly (crashes and hangs
on certain systems).  They are all problems that show up particularly
when dealing with non-ASCII characters, but the last two aren't actually
specific to that:

- Finish making padding multibyte-aware: it's not done for printf nor
for padding using typeset flags.
- The matcher specifications in completion don't handle multibyte
characters and are currently written in such a way as to make this
hard (similar to the old suffix character handling).
- $' quoting is badly handled in completion.
- More untokenization is needed; certainly in parameter substitution and
quite possibly in random other places including completion.

No doubt more will show up.  This isn't going to be quick to get right.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


To access the latest news from CSR copy this link into a web browser:  http://www.csr.com/email_sig.php


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Stuff to do
  2006-09-27 12:11 Stuff to do Peter Stephenson
@ 2006-09-27 13:09 ` DervishD
  2006-09-29  3:20   ` Bart Schaefer
  2006-09-29 16:37 ` Andrey Borzenkov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: DervishD @ 2006-09-27 13:09 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

    Hi Peter :)

 * Peter Stephenson <pws@csr.com> dixit:
> No doubt more will show up.  This isn't going to be quick to get right.

    Peter, you're doing a superb job maintaining, fixing and
improving the mess that the zsh code is. The same goes to the other
contributors: you're geniuses (evil geniuses, of course XDD).

    I've read some of the code in zsh and sincerely, I would start
it from scratch. Maintaining such messy code is one of the worst
things I have in my worst nightmares.

    So, if this is not going to be quick, it doesn't matter at all.
Even if you stop maintaining zsh, it doesn't matter at all. We should
be more than thankful to you and the other contributors for what
you've done until now. It's the kind of work that it impossible to
pay, and the reason I make free software.

    Thanks a lot for your hard work: zsh is just fantastic and
although far from perfect, the people behind it are just fantastic.
Go on :))

    Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736 | http://www.dervishd.net
It's my PC and I'll cry if I want to... RAmen!


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Stuff to do
  2006-09-27 13:09 ` DervishD
@ 2006-09-29  3:20   ` Bart Schaefer
  2006-09-29  8:05     ` DervishD
  0 siblings, 1 reply; 12+ messages in thread
From: Bart Schaefer @ 2006-09-29  3:20 UTC (permalink / raw)
  To: Zsh hackers list

On Sep 27,  3:09pm, DervishD wrote:
}
}     I've read some of the code in zsh and sincerely, I would start
} it from scratch.

Having worked in the software industry for going on 20 years now, I
can say with quite some confidence that once a piece of code reaches a
certain level of complexity, this is almost always the wrong approach.
Unless of course your goal is to abandon everything about the original
program and re-create only a subset -- because the chances of ever
getting back to where you started are very small.

Zsh passed that level of complexity a very long time ago.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Stuff to do
  2006-09-29  3:20   ` Bart Schaefer
@ 2006-09-29  8:05     ` DervishD
  0 siblings, 0 replies; 12+ messages in thread
From: DervishD @ 2006-09-29  8:05 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

    Hi Bart :)

 * Bart Schaefer <schaefer@brasslantern.com> dixit:
> On Sep 27,  3:09pm, DervishD wrote:
> }     I've read some of the code in zsh and sincerely, I would
> } start it from scratch.
> 
> Having worked in the software industry for going on 20 years now, I
> can say with quite some confidence that once a piece of code
> reaches a certain level of complexity, this is almost always the
> wrong approach.

    I think that this depends on how messy is the code in the
project. I mean, if the code is so messy (and that doesn't mean
complex) that every time you touch anything you wreck havoc, then
doing something from start is a good idea (IMHO).

    I suppose that it's a question of evaluating maintainability
versus the effort of starting from scratch. If a project is barely
maintainable, sometimes it's better to start from scratch, but again
that depends on how complex the design is, of course.

> Zsh passed that level of complexity a very long time ago.

    I know, and of course I'm not asking you to start the code from
scratch, I'm just saying that *I* would do it ;)) I have a project
that too passed that level of complexity a time ago, and in fact I
left the project in the hands of the other co-author because I wasn't
motivated enough to continue, and probably rewriting it will be a
complete stupidity, but if I had the time, I swear I'll rewrite it
from scratch. I simply hate maintaining code that tries to resist
every try of being maintained and which contains pieces that nobody
understands XD

    With Zsh, I think that I would use a subset of it, as you
suggest, if I would start a "new-Zsh". Rewriting all from the scratch
is probably impossible now, because if you want to create a
bug-compatible version of "new-Zsh" you will probably have to use the
same code, so...

    Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736 | http://www.dervishd.net
It's my PC and I'll cry if I want to... RAmen!


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Stuff to do
  2006-09-27 12:11 Stuff to do Peter Stephenson
  2006-09-27 13:09 ` DervishD
@ 2006-09-29 16:37 ` Andrey Borzenkov
  2006-09-29 17:08   ` Peter Stephenson
  2006-09-29 18:08 ` Andrey Borzenkov
  2006-10-08 15:38 ` quest for bld_line (was: Re: Stuff to do) Andrey Borzenkov
  3 siblings, 1 reply; 12+ messages in thread
From: Andrey Borzenkov @ 2006-09-29 16:37 UTC (permalink / raw)
  To: zsh-workers


[-- Attachment #1.1: Type: text/plain, Size: 3008 bytes --]

On Wednesday 27 September 2006 16:11, Peter Stephenson wrote:
> - The matcher specifications in completion don't handle multibyte
> characters and are currently written in such a way as to make this
> hard (similar to the old suffix character handling).

I have been looking at it recently. So far the following issues came up.

1. matcher code assumes character == byte and is using 256 bytes array to 
build character equivalence classes. What is worse, it is passing this array 
around between different functions to suppply results of previous matching. I 
have here patch (attached) that eliminates external dependency on this array 
so matcher internals can be more easily changed. This seems to make code a 
bit more understandable irrespectively :) OK to commit?

2. Usage of magic array for character classes ([abcd]) can be naturally 
superceded by using either generic pattern matching or direct comparison. 
Pattern matching provides for using something like [[:lower:]] and possibly 
using matchers etc but potential side effects of extended globbing need 
review. I do not know what is faster. Is it OK?

3. Equivalence classes ({abcd}={xyzw}) do not scale beyond single byte 
characters. But if we check usage I believe, it has never been used for 
anything beyond case-insensitive matching. For this particular usage I 
suggest using new matcher type:

m:LPAT>upper
m:LPAT>lower

with obvious semantic - character from line is converted to lower or upper and 
compared with character from potential match. So m:{a-z}={A-Z} becomes 
m:?>upper etc.

We still can implement {...} for character _set_ but not for character range. 
So far I do not consider it major problem.

4. The hardest part. Right anchor. For this matcher must match _backward_. I 
am not aware of any way to walk backward as long as we assume arbitrary 
encoding. Options apparently are

a) careful modification of code to compute line and patter length and try to 
match from the (llen - plen) point. After all we know that we do not need to 
match more than that. This may be doable, I did not yet try.

b) convert this code to use wide characters. Not sure if this is a viable 
option.

c) Use UTF-8 :) It is no joke - I expect nowadays 99% of all systems using 
multibyte encoding use UTF-8. So we may concentrate on most commonly used 
case and think about other encoding later if anyone really needs it. As soon 
as we assume input be in UTF-8 we immediately get enormous advantages

- it can be easily traversed both backwards and forwards (meta adds some 
complexity but not much)
- length is known in advance, we can save mbrtowc() calls
- if we know that system is using UCS-4 for wide characters (and I guess we do 
not support others at all) converting to/from wide character can be 
implemented without calling mbrtowc() & Co. at all.
- (remotely) this allows us to get rid of META which saves quite a bit of 
memory allocation overhead.

Comments? 

[-- Attachment #1.2: pattern_match.diff --]
[-- Type: text/x-diff, Size: 8852 bytes --]

Index: Src/Zle/compmatch.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/Zle/compmatch.c,v
retrieving revision 1.48
diff -u -p -r1.48 compmatch.c
--- Src/Zle/compmatch.c	12 Jan 2006 00:51:51 -0000	1.48
+++ Src/Zle/compmatch.c	29 Sep 2006 16:00:31 -0000
@@ -461,7 +461,6 @@ match_str(char *l, char *w, Brinfo *bpp,
 {
     int ll = strlen(l), lw = strlen(w), oll = ll, olw = lw, exact = 0, wexact = 0;
     int il = 0, iw = 0, t, ind, add, he = 0, bpc, obc = bc, bslash;
-    VARARR(unsigned char, ea, (ll > lw ? ll : lw) + 1);
     char *ow;
     Cmlist ms;
     Cmatcher mp, lm = NULL;
@@ -795,9 +794,7 @@ match_str(char *l, char *w, Brinfo *bpp,
 					  (il || iw)));
 		    }
 		    /* Now try to match the line and word patterns. */
-		    if (!t ||
-			!pattern_match(mp->line, tl, NULL, ea) ||
-			!pattern_match(mp->word, tw, ea, NULL))
+		    if (!t || !pattern_match(mp->line, tl, mp->word, tw))
 			continue;
 
 		    /* Probably add the matched strings. */
@@ -1091,44 +1088,51 @@ comp_match(char *pfx, char *sfx, char *w
 }
 
 /* Check if the given pattern matches the given string.             *
- * `in' and `out' are used for {...} classes. In `out' we store the *
- * character number that was matched. In the word pattern this is   *
- * given in `in' so that we can easily test if we found the         *
- * corresponding character. */
+ *  p and  s are either anchor or line pattern and string;
+ * wp and ws are word (candidate) pattern and string
+ *
+ * If only one pattern is given, we just check if characters match
+ * If both line and word are given, we check that characters match
+ * for {...} classes by comparing relative numbers in sequence.
+ *
+ * Patterns and strings are always passed in pairs, so it is enough
+ * to check for non-NULL wp. p should always be present.
+ */
 
 /**/
 mod_export int
-pattern_match(Cpattern p, char *s, unsigned char *in, unsigned char *out)
+pattern_match(Cpattern p, char *s, Cpattern wp, char *ws)
 {
     unsigned char c;
+    unsigned char wc;
 
-    while (p) {
-	c = *((unsigned char *) s);
+    while (p && wp && *s && *ws) {
+	c = p->tab[*((unsigned char *) s)];
+	wc = wp->tab[*((unsigned char *) ws)];
 
-	if (out)
-	    *out = 0;
-
-	if (p->equiv) {
-	    if (in) {
-		c = p->tab[c];
-		if ((*in && *in != c) || (!*in && !c))
-		    return 0;
-	    } else if (out) {
-		if (!(*out = p->tab[c]))
-		    return 0;
-	    } else if (!p->tab[c])
-		return 0;
-
-	    if (in && *in)
-		in++;
-	    if (out)
-		out++;
-	} else if (!p->tab[c])
+	if (!c || !wc || c != wc)
 	    return 0;
 
 	s++;
+	ws++;
 	p = p->next;
+	wp = wp->next;
+    }
+
+    while (p && *s) {
+	if (!p->tab[*((unsigned char *) s)])
+	    return 0;
+	p = p->next;
+	s++;
     }
+
+    while (wp && *ws) {
+	if (!wp->tab[*((unsigned char *) ws)])
+	    return 0;
+	wp = wp->next;
+	ws++;
+    }
+
     return 1;
 }
 
@@ -1214,38 +1218,48 @@ bld_parts(char *str, int len, int plen, 
  * buffer line. Initially line is the same as lp, but during recursive
  * calls lp is incremented for storing successive characters. Whenever
  * a full possible string is build, we test if this line matches the
- * string given by wlen and word. The in argument contains the characters
- * to use for the correspondence classes, it was filled by a call to 
- * pattern_match() in the calling function.
+ * string given by wlen and word.
+ *
+ * wpat contains pattern that matched previously
+ * lpat contains the pattern for line we build
+ * mword is a string that matched wpat before
+ * word is string that we try to match now
+ *
  * The return value is the length of the string matched in the word, it
- * is zero if we couldn't build a line that matches the word. */
+ * is zero if we couldn't build a line that matches the word.
+ */
+
 
 /**/
 static int
-bld_line(Cpattern pat, char *line, char *lp,
-	 char *word, int wlen, unsigned char *in, int sfx)
+bld_line(Cpattern wpat, Cpattern lpat, char *line, char *lp,
+	 char *mword, char *word, int wlen, int sfx)
 {
-    if (pat) {
+    if (lpat) {
 	/* Still working on the pattern. */
 
 	int i, l;
 	unsigned char c = 0;
 
 	/* Get the number of the character for a correspondence class
-	 * if it has a correxponding class. */
-	if (pat->equiv)
-	    if ((c = *in))
-		in++;
+	 * if it has a corresponding class. */
+	if (lpat->equiv)
+	    if (wpat && *mword) {
+		c = wpat->tab[STOUC(*mword)];
+		wpat = wpat->next;
+		mword++;
+	    }
+
 
 	/* Walk through the table in the pattern and try the characters
 	 * that may appear in the current position. */
 	for (i = 0; i < 256; i++)
-	    if ((pat->equiv && c) ? (c == pat->tab[i]) : pat->tab[i]) {
+	    if ((lpat->equiv && c) ? (c == lpat->tab[i]) : lpat->tab[i]) {
 		*lp = i;
 		/* We stored the character, now call ourselves to build
 		 * the rest. */
-		if ((l = bld_line(pat->next, line, lp + 1, word, wlen,
-				  in, sfx)))
+		if ((l = bld_line(wpat, lpat->next, line, lp + 1,
+				  mword, word, wlen, sfx)))
 		    return l;
 	    }
     } else {
@@ -1255,7 +1269,6 @@ bld_line(Cpattern pat, char *line, char 
 	Cmlist ms;
 	Cmatcher mp;
 	int l = lp - line, t, rl = 0, ind, add;
-	VARARR(unsigned char, ea, l + 1);
 
 	/* Quick test if the strings are exactly the same. */
 	if (l == wlen && !strncmp(line, word, l))
@@ -1279,9 +1292,7 @@ bld_line(Cpattern pat, char *line, char 
 		    mp = ms->matcher;
 		    if (mp && !mp->flags && mp->wlen <= wlen && mp->llen <= l &&
 			pattern_match(mp->line, (sfx ? line - mp->llen : line),
-				      NULL, ea) &&
-			pattern_match(mp->word, (sfx ? word - mp->wlen : word),
-				      ea, NULL)) {
+				      mp->word, (sfx ? word - mp->wlen : word))) {
 			/* Both the line and the word pattern matched,
 			 * now skip over the matched portions. */
 			if (sfx) {
@@ -1316,7 +1327,6 @@ join_strs(int la, char *sa, int lb, char
     static char *rs = NULL;
     static int rl = 0;
 
-    VARARR(unsigned char, ea, (la > lb ? la : lb) + 1);
     Cmlist ms;
     Cmatcher mp;
     int t, bl, rr = rl;
@@ -1331,8 +1341,7 @@ join_strs(int la, char *sa, int lb, char
 		    mp->wlen <= la && mp->wlen <= lb) {
 		    /* The pattern has no anchors and the word
 		     * pattern fits, try it. */
-		    if ((t = pattern_match(mp->word, sa, NULL, ea)) ||
-			pattern_match(mp->word, sb, NULL, ea)) {
+		    if ((t = pattern_match(mp->word, sa, mp->word, sb))) {
 			/* It matched one of the strings, t says which one. */
 			VARARR(char, line, mp->llen + 1);
 			char **ap, **bp;
@@ -1347,8 +1356,8 @@ join_strs(int la, char *sa, int lb, char
 			}
 			/* Now try to build a string that matches the other
 			 * string. */
-			if ((bl = bld_line(mp->line, line, line,
-					   *bp, *blp, ea, 0))) {
+			if ((bl = bld_line(mp->word, mp->line, line, line,
+					   *ap, *bp, *blp, 0))) {
 			    /* Found one, put it into the return string. */
 			    line[mp->llen] = '\0';
 			    if (rr <= mp->llen) {
@@ -1503,7 +1512,6 @@ join_sub(Cmdata md, char *str, int len, 
 	int ol = len, nl = md->len;
 	Cmlist ms;
 	Cmatcher mp;
-	VARARR(unsigned char, ea, (ol > nl ? ol : nl) + 1);
 	int t;
 
 	if (sfx) {
@@ -1519,9 +1527,7 @@ join_sub(Cmdata md, char *str, int len, 
 		 * new one. */
 		if (mp->llen <= ol && mp->wlen <= nl &&
 		    pattern_match(mp->line, ow - (sfx ? mp->llen : 0),
-				  NULL, ea) &&
-		    pattern_match(mp->word, nw - (sfx ? mp->wlen : 0),
-				  ea, NULL)) {
+				  mp->word, nw - (sfx ? mp->wlen : 0))) {
 		    /* It did, update the contents of the cmdata struct
 		     * and return a cline for the matched part. */
 		    if (sfx)
@@ -1539,17 +1545,22 @@ join_sub(Cmdata md, char *str, int len, 
 		 * pattern matches one of the strings. */
 		if (join && mp->wlen <= ol && mp->wlen <= nl &&
 		    ((t = pattern_match(mp->word, ow - (sfx ? mp->wlen : 0),
-				       NULL, ea)) ||
+				       NULL, NULL)) ||
 		     pattern_match(mp->word, nw - (sfx ? mp->wlen : 0),
-				   NULL, ea))) {
+				   NULL, NULL))) {
 		    VARARR(char, line, mp->llen + 1);
 		    int bl;
+		    char *mw;
 
 		    /* Then build all the possible lines and see
 		     * if one of them matches the other string. */
-		    if ((bl = bld_line(mp->line, line, line,
-				       (t ? nw : ow), (t ? nl : ol),
-				       ea, sfx))) {
+		    if (t)
+			mw = ow - (sfx ? mp->wlen : 0);
+		    else
+			mw = nw - (sfx ? mp->wlen : 0);
+
+		    if ((bl = bld_line(mp->word, mp->line, line, line,
+				       mw, (t ? nw : ow), (t ? nl : ol), sfx)))  {
 			/* Yep, one of the lines matched the other
 			 * string. */
 			line[mp->llen] = '\0';

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Stuff to do
  2006-09-29 16:37 ` Andrey Borzenkov
@ 2006-09-29 17:08   ` Peter Stephenson
  2006-09-29 18:08     ` Andrey Borzenkov
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Stephenson @ 2006-09-29 17:08 UTC (permalink / raw)
  To: zsh-workers

Andrey Borzenkov <arvidjaar@newmail.ru> wrote:
> 1. matcher code assumes character == byte and is using 256 bytes array to
> build character equivalence classes. What is worse, it is passing this array
> around between different functions to suppply results of previous matching. I
> have here patch (attached) that eliminates external dependency on this array
> so matcher internals can be more easily changed. This seems to make code a
> bit more understandable irrespectively :) OK to commit?

Yes, the more the calling conventions are sanitized like this the better
I like it.  The references to external data are one of my worst
nightmares.

> 2. Usage of magic array for character classes ([abcd]) can be naturally
> superceded by using either generic pattern matching or direct comparison.
> Pattern matching provides for using something like [[:lower:]] and possibly
> using matchers etc but potential side effects of extended globbing need
> review. I do not know what is faster. Is it OK?

I'd be quite keen on being able to do this by using globbing.  I think the
current uses of matcher specifications are limited enough (sometimes by
necessity, as we're seeing) that an extension wouldn't be a problem for
compatibility; however, I don't know how to mix this with the equivalence
class stuff.  It would be quite nice to keep it in one place in pattern.c,
but I doubt if that's going to work with all the additions we need.

> 3. Equivalence classes ({abcd}={xyzw}) do not scale beyond single byte
> characters. But if we check usage I believe, it has never been used for
> anything beyond case-insensitive matching. For this particular usage I
> suggest using new matcher type:
>
> m:LPAT>upper
> m:LPAT>lower
>
> with obvious semantic - character from line is converted to lower or upper and
> compared with character from potential match. So m:{a-z}={A-Z} becomes
> m:?>upper etc.
>
> We still can implement {...} for character _set_ but not for character range.
> So far I do not consider it major problem.

I think we'll need to keep it working for ASCII for compatibility, but not
extending it to other characters is, as you say, not a big problem.
However, maybe it's not a problem at all; see below.

> 4. The hardest part. Right anchor. For this matcher must match _backward_. I
> am not aware of any way to walk backward as long as we assume arbitrary
> encoding. Options apparently are
>...
> b) convert this code to use wide characters. Not sure if this is a viable
> option.

This is the option I was thinking about, and it removes the range problem
since it extends the ASCII logic in a natural way (it may be system
dependent, but that's the absolute least of our worries).

I don't think it's a problem using wide characters locally for the
comparisons.  Indeed, the pattern match code does all its character class
stuff with wide characters (or kludged wide characters which are just the
unsigned char values if a multibyte sequence doesn't convert).  It doesn't
really make sense to allow for unconvertible characters in matcher
comparisons---it's great to be able to insert them on the command line in
some fashion, but the matcher specs only make sense for characters that are
convertible.

The worst problem is that we lose the ability to do matching control where
(say) much of the string is ASCII, and our match rules only use ASCII, but
there are also characters that don't work in the current locale.  I don't
think this is a big issue and there are possible ways round:
- partial conversion
- convert them at this stage to $'\...' sequences instead of later
- use marked wide characters where we record a byte that can't be converted
--- any of which could be bolted on later.  So I don't think that's a
showstopper.

I was wondering how much of the code we needed to convert to use wide
characters, and vaguely came to the conclusion the only reasonable sane way
was to do it fairly locally within the comparison function(s), since
otherwise the interface to the rest of the completion system gets very
hairy.  However, I haven't actually looked at the code again since
coming to that conclusion.

However, if there's an easy way of doing it by another method, fine.  I
suspect there isn't.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


To access the latest news from CSR copy this link into a web browser:  http://www.csr.com/email_sig.php


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Stuff to do
  2006-09-29 17:08   ` Peter Stephenson
@ 2006-09-29 18:08     ` Andrey Borzenkov
  0 siblings, 0 replies; 12+ messages in thread
From: Andrey Borzenkov @ 2006-09-29 18:08 UTC (permalink / raw)
  To: zsh-workers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Friday 29 September 2006 21:08, Peter Stephenson wrote:
>
> Yes, the more the calling conventions are sanitized like this the better
> I like it.  The references to external data are one of my worst
> nightmares.
>

Ok I committed it.

> > 2. Usage of magic array for character classes ([abcd]) can be naturally
> > superceded by using either generic pattern matching or direct comparison.
> > Pattern matching provides for using something like [[:lower:]] and
> > possibly using matchers etc but potential side effects of extended
> > globbing need review. I do not know what is faster. Is it OK?
>
> I'd be quite keen on being able to do this by using globbing.  I think the
> current uses of matcher specifications are limited enough (sometimes by
> necessity, as we're seeing) that an extension wouldn't be a problem for
> compatibility; however, I don't know how to mix this with the equivalence
> class stuff.  It would be quite nice to keep it in one place in pattern.c,
> but I doubt if that's going to work with all the additions we need.
>
> > 3. Equivalence classes ({abcd}={xyzw}) do not scale beyond single byte
> > characters. But if we check usage I believe, it has never been used for
> > anything beyond case-insensitive matching. For this particular usage I
> > suggest using new matcher type:
> >
> > m:LPAT>upper
> > m:LPAT>lower
> >
> > with obvious semantic - character from line is converted to lower or
> > upper and compared with character from potential match. So m:{a-z}={A-Z}
> > becomes m:?>upper etc.
> >
> > We still can implement {...} for character _set_ but not for character
> > range. So far I do not consider it major problem.
>
> I think we'll need to keep it working for ASCII for compatibility,

yes, that for sure; the idea was actually check if character is < 256 and 
reject pattern otherwise.

> but not 
> extending it to other characters is, as you say, not a big problem.
> However, maybe it's not a problem at all; see below.
>
> > 4. The hardest part. Right anchor. For this matcher must match
> > _backward_. I am not aware of any way to walk backward as long as we
> > assume arbitrary encoding. Options apparently are
> >...
> > b) convert this code to use wide characters. Not sure if this is a viable
> > option.
>
> This is the option I was thinking about, and it removes the range problem
> since it extends the ASCII logic in a natural way (it may be system
> dependent, but that's the absolute least of our worries).
>

It does not, unfortunately. For ranges that is. What you effectively suggest 
is to assume that for {a-z}={A-Z} p matches q if p-a == q-A. This is not 
actually true even for 8 bit EBCDIC (I won't eat my hat on it though :); for 
arbitrary encoding it is simply meaningless. This fails even for basic 
European plane (because ß has no upper counterpart); and as soon as you move 
to next plane (1xx) it stops working completely.

We really need something more portable; the exact syntax is open to discussion 
of course but so far I like m:?>upper :) It is so zsh-ish obscure ...

> I don't think it's a problem using wide characters locally for the
> comparisons.  Indeed, the pattern match code does all its character class
> stuff with wide characters (or kludged wide characters which are just the
> unsigned char values if a multibyte sequence doesn't convert).  It doesn't
> really make sense to allow for unconvertible characters in matcher
> comparisons---it's great to be able to insert them on the command line in
> some fashion, but the matcher specs only make sense for characters that are
> convertible.
>
> The worst problem is that we lose the ability to do matching control where
> (say) much of the string is ASCII, and our match rules only use ASCII, but
> there are also characters that don't work in the current locale.  I don't
> think this is a big issue and there are possible ways round:
> - partial conversion
> - convert them at this stage to $'\...' sequences instead of later

as long as we are able to distinguish between line with real $'...' it is 
fine. Are we?

> - use marked wide characters where we record a byte that can't be converted

That is probably better. This has to be implemented at some point anyway to 
allow input of arbitrary data (vared)

> --- any of which could be bolted on later.  So I don't think that's a
> showstopper.
>
> I was wondering how much of the code we needed to convert to use wide
> characters, and vaguely came to the conclusion the only reasonable sane way
> was to do it fairly locally within the comparison function(s), 

Sure at some point pattern_match will have to convert to wide character to 
test for properties. It does not solve the l[-1] problem; and code is full of 
them.

> since 
> otherwise the interface to the rest of the completion system gets very
> hairy.

Exactly. I tried to convert compamtch.c to wide characters. This itself is 
more or less mechanical task; but it adds insane amount of converting all 
over the other places. This is already slow as is :(

> However, I haven't actually looked at the code again since 
> coming to that conclusion.
>
> However, if there's an easy way of doing it by another method, fine.  I
> suspect there isn't.

it depends on your definition of easy :)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFFHWEGR6LMutpd94wRAqqBAJ0b+XY0Fr24mg0LZpY/CxyMnKZgUwCgzRZE
qirOdaNVPC4pk0wivdvEAvY=
=ocDN
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Stuff to do
  2006-09-27 12:11 Stuff to do Peter Stephenson
  2006-09-27 13:09 ` DervishD
  2006-09-29 16:37 ` Andrey Borzenkov
@ 2006-09-29 18:08 ` Andrey Borzenkov
  2006-10-08 15:38 ` quest for bld_line (was: Re: Stuff to do) Andrey Borzenkov
  3 siblings, 0 replies; 12+ messages in thread
From: Andrey Borzenkov @ 2006-09-29 18:08 UTC (permalink / raw)
  To: zsh-workers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wednesday 27 September 2006 16:11, Peter Stephenson wrote:
> I'm aware of the following things that need fixing, not including
> problems we aren't currently able to debug directly (crashes and hangs
> on certain systems).  They are all problems that show up particularly
> when dealing with non-ASCII characters, but the last two aren't actually
> specific to that:
>
> - Finish making padding multibyte-aware: it's not done for printf nor
> for padding using typeset flags.
> - The matcher specifications in completion don't handle multibyte
> characters and are currently written in such a way as to make this
> hard (similar to the old suffix character handling).
> - $' quoting is badly handled in completion.
> - More untokenization is needed; certainly in parameter substitution and
> quite possibly in random other places including completion.
>
> No doubt more will show up. 

- - vared does not work for binary data
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFFHWEtR6LMutpd94wRAiUCAJ97uTgt6pc5znNmHDu/IMm93uffaQCcDQ6N
aP1EqG7i4/CRLf7xS/0lIpA=
=wS6Q
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 12+ messages in thread

* quest for bld_line (was: Re: Stuff to do)
  2006-09-27 12:11 Stuff to do Peter Stephenson
                   ` (2 preceding siblings ...)
  2006-09-29 18:08 ` Andrey Borzenkov
@ 2006-10-08 15:38 ` Andrey Borzenkov
  2006-10-09 12:00   ` Peter Stephenson
  3 siblings, 1 reply; 12+ messages in thread
From: Andrey Borzenkov @ 2006-10-08 15:38 UTC (permalink / raw)
  To: zsh-workers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wednesday 27 September 2006 16:11, Peter Stephenson wrote:
> - The matcher specifications in completion don't handle multibyte
> characters and are currently written in such a way as to make this
> hard (similar to the old suffix character handling).

OK here is next patch that does not fix the above but tries to remove one more 
obstacle for it.

bld_line tries to find (and actually build) a line that can match two given 
words. It does so by building *all* possible lines that match one word and 
trying to match every built line against second word. Now the word "all" 
makes possibility to do the same for arbitrary character set rather abstract.

I must admit that I still do not understand why Sven needed this function nor 
how line that it builds is used later. What I am confident in, the Clines 
that are built using this function are removed later in compresult and never 
appear anywhere on command line.

I tried to invent some way to mimic it as close to original as I could. It is 
incomplete; nor am I sure if there any way to do it differently.

The point of patch is to replace exhaustive enumeration of all possible 
combinations by comparison of patterns. I.e. it checks if two patterns may 
have something in common - this can be generalized later using different 
pattern representations.

I would be happy if we could just toss away this function.

Comments?

Index: Src/Zle/compmatch.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/Zle/compmatch.c,v
retrieving revision 1.50
diff -u -p -r1.50 compmatch.c
- --- Src/Zle/compmatch.c	30 Sep 2006 06:53:15 -0000	1.50
+++ Src/Zle/compmatch.c	8 Oct 2006 15:27:11 -0000
@@ -1214,105 +1214,164 @@ bld_parts(char *str, int len, int plen, 
     return ret;
 }
 
- -/* This builds all the possible line patterns for the pattern pat in the
- - * buffer line. Initially line is the same as lp, but during recursive
- - * calls lp is incremented for storing successive characters. Whenever
- - * a full possible string is build, we test if this line matches the
- - * string given by wlen and word.
+/*
+ * Compare two different line patterns if they can have some common character
+ * Insert one of common characters in line we are building (it does not 
matter
+ * which one)
+ * mlp	- line pattern which has matched before
+ * mwp	- word pattern which has matched before
+ * nlp	- new line pattern that we currently test against mlp
+ * nwp	- new word pattern that we currently test against mwp
+ * line	- line we build; we insert characters there
+ */
+
+/**/
+static int
+pattern_compare(Cpattern mlp, Cpattern mwp, Cpattern nlp, Cpattern nwp,
+		char *line)
+{
+    while (nlp) {
+	int i;
+
+	/*
+	 * test to see if mlp and nlp have something in commons
+	 * nlp cannot be less than mlp (we check pattern length before)
+	 * but word pattern may of course be shorter than line ...
+	 */
+	for (i = 0; i < 256; i++)
+	    if (mlp->tab[i] && nlp->tab[i]) {
+		/* for equiv. class they must also match word pattern */
+		if (mlp->equiv) {
+		    if (!mwp || !nwp || (mlp->tab[i] == mwp->tab[i] &&
+			nlp->tab[i] == nwp->tab[i]))
+			break;
+		} else
+		    break;
+	    }
+	if (i < 256) {
+	    /* OK we found character that matches both matchers */
+	    *line++ = (char)i;
+	} else {
+	    /* No matching character */
+	    return 0;
+	}
+	/* FIXME can this be out of bounds? */
+	mlp = mlp->next;
+	nlp = nlp->next;
+	if (mwp) mwp = mwp->next;
+	if (nwp) nwp = nwp->next;
+    }
+
+    return 1;
+}
+
+/* This tries to find out, if there is common line that may match two
+ * words (possible matches or parts thereof). When this function is called,
+ * it is ensured that `mword' has matched word pattern in `matcher';
+ * we try to find a string that both matches line pattern in `matcher'
+ * and another word `word'
  *
- - * wpat contains pattern that matched previously
- - * lpat contains the pattern for line we build
- - * mword is a string that matched wpat before
- - * word is string that we try to match now
+ * matcher - matcher that `mword' has been matched against
+ * line    - buffer for string we build
+ * mword   - word that has matched word pattern in `matcher' before
+ * word    - is string that we try to match now
+ * wlen    - length of `word'
+ * sfx     - if we should match bacwards
  *
- - * The return value is the length of the string matched in the word, it
+ * The return value is the length of the string matched in the `word', it
  * is zero if we couldn't build a line that matches the word.
+ *
+ * FIXME implementation is incomplete. In particular, it won't catch
+ * the case when part of line would have been equal to `word' and part
+ * requires matchers. I cannot find a way to do it without exaustive
+ * building of all possible line's that cannot be done as long as patterns
+ * may contain arbitrary multibyte characters
  */
 
- -
 /**/
 static int
- -bld_line(Cpattern wpat, Cpattern lpat, char *line, char *lp,
+bld_line(Cmatcher matcher,  char *line,
 	 char *mword, char *word, int wlen, int sfx)
 {
- -    if (lpat) {
- -	/* Still working on the pattern. */
- -
- -	int i, l;
- -	unsigned char c = 0;
- -
- -	/* Get the number of the character for a correspondence class
- -	 * if it has a corresponding class. */
- -	if (lpat->equiv)
- -	    if (wpat && *mword) {
- -		c = wpat->tab[STOUC(*mword)];
- -		wpat = wpat->next;
- -		mword++;
- -	    }
+    VARARR(Cpattern, mlpa, matcher->llen);
+    VARARR(Cpattern, mwpa, matcher->wlen);
+    Cmlist ms;
+    Cmatcher mp;
+    Cpattern pat;
+    char *lp;
+    int l = matcher->llen, t, rl = 0, ind, add, il, iw, i;
+
+    /* Quick test if word may be direct input line */
+    if (l == wlen &&
+	pattern_match(matcher->line, word,
+		      matcher->word, mword)) {
+	strncpy(line, word, wlen);
+	line[l] = '\0';
+	return l;
+    }
 
+    /* Setup array instead of list; this is required for suffix match */
+    for (i = 0, pat = matcher->line; pat; i++, pat = pat->next)
+	mlpa[i] = pat;
+    for (i = 0, pat = matcher->word; pat; i++, pat = pat->next)
+	mwpa[i] = pat;
 
- -	/* Walk through the table in the pattern and try the characters
- -	 * that may appear in the current position. */
- -	for (i = 0; i < 256; i++)
- -	    if ((lpat->equiv && c) ? (c == lpat->tab[i]) : lpat->tab[i]) {
- -		*lp = i;
- -		/* We stored the character, now call ourselves to build
- -		 * the rest. */
- -		if ((l = bld_line(wpat, lpat->next, line, lp + 1,
- -				  mword, word, wlen, sfx)))
- -		    return l;
- -	    }
+    if (sfx) {
+	ind = -1; add = -1;
+	il = matcher->llen;
+	iw = matcher->wlen;
+	lp = line + il; word += wlen;
     } else {
- -	/* We reached the end, i.e. the line string is fully build, now
- -	 * see if it matches the given word. */
- -
- -	Cmlist ms;
- -	Cmatcher mp;
- -	int l = lp - line, t, rl = 0, ind, add;
- -
- -	/* Quick test if the strings are exactly the same. */
- -	if (l == wlen && !strncmp(line, word, l))
- -	    return l;
+	ind = 0; add = 1;
+	il = iw = 0;
+	lp = line;
+    }
 
- -	if (sfx) {
- -	    line = lp; word += wlen;
- -	    ind = -1; add = -1;
- -	} else {
- -	    ind = 0; add = 1;
- -	}
- -	/* We loop through the whole line string built. */
- -	while (l && wlen) {
- -	    if (word[ind] == line[ind]) {
- -		/* The same character in both strings, skip over. */
- -		line += add; word += add;
- -		l--; wlen--; rl++;
- -	    } else {
- -		t = 0;
- -		for (ms = bmatchers; ms && !t; ms = ms->next) {
- -		    mp = ms->matcher;
- -		    if (mp && !mp->flags && mp->wlen <= wlen && mp->llen <= l &&
- -			pattern_match(mp->line, (sfx ? line - mp->llen : line),
- -				      mp->word, (sfx ? word - mp->wlen : word))) {
- -			/* Both the line and the word pattern matched,
- -			 * now skip over the matched portions. */
- -			if (sfx) {
- -			    line -= mp->llen; word -= mp->wlen;
- -			} else {
- -			    line += mp->llen; word += mp->wlen;
- -			}
- -			l -= mp->llen; wlen -= mp->wlen; rl += mp->wlen;
- -			t = 1;
+    /* Loop through both words */
+    while (l && wlen) {
+#if 0
+	/* FIXME this code is likely wrong and so is disabled for now */
+	if (word[ind] == mword[ind]) {
+	    /* The same character in both strings, add it to line and
+	     * skip over. */
+	    lp[ind] = word[ind];
+	    lp += add; word += add; mword += add;
+	    l--; wlen--; rl++;
+	} else
+#endif
+	{
+	    t = 0;
+	    for (ms = bmatchers; ms && !t; ms = ms->next) {
+		mp = ms->matcher;
+		if (mp && !mp->flags && mp->wlen <= wlen && mp->llen <= l &&
+		    pattern_match(mp->word, (sfx ? word - mp->wlen : word),
+				  NULL, NULL) &&
+		    pattern_compare(mlpa[sfx ? il - mp->llen : il],
+				    mwpa[sfx ? iw - mp->wlen : iw],
+				    mp->line, mp->word,
+				    (sfx ? lp - mp->llen : lp))) {
+		    /* Both the line and the word pattern matched,
+		     * now skip over the matched portions. */
+		    if (sfx) {
+			lp -= mp->llen; word -= mp->wlen;
+			il -= mp->llen; iw   -= mp->wlen;
+		    } else {
+			lp += mp->llen; word += mp->wlen;
+			il += mp->llen; iw   += mp->wlen;
 		    }
+		    l -= mp->llen; wlen -= mp->wlen; rl += mp->wlen;
+		    t = 1;
 		}
- -		if (!t)
- -		    /* Didn't match, give up. */
- -		    return 0;
 	    }
+	    if (!t)
+		/* Didn't match, give up. */
+		return 0;
 	}
- -	if (!l)
- -	    /* Unmatched portion in the line built, return matched length. */
- -	    return rl;
     }
+    if (!l)
+	/* Unmatched portion in the line built, return matched length. */
+	return rl;
+
     return 0;
 }
 
@@ -1357,7 +1416,7 @@ join_strs(int la, char *sa, int lb, char
 			}
 			/* Now try to build a string that matches the other
 			 * string. */
- -			if ((bl = bld_line(mp->word, mp->line, line, line,
+			if ((bl = bld_line(mp, line,
 					   *ap, *bp, *blp, 0))) {
 			    /* Found one, put it into the return string. */
 			    line[mp->llen] = '\0';
@@ -1560,7 +1619,7 @@ join_sub(Cmdata md, char *str, int len, 
 		    else
 			mw = nw - (sfx ? mp->wlen : 0);
 
- -		    if ((bl = bld_line(mp->word, mp->line, line, line,
+		    if ((bl = bld_line(mp, line,
 				       mw, (t ? nw : ow), (t ? nl : ol), sfx)))  {
 			/* Yep, one of the lines matched the other
 			 * string. */
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFFKRt9R6LMutpd94wRAqnyAJ0TDZPQf5XZGTiYyHgi7Kn7KRTxRACgqq2S
eVFyUpbIOQljAVVl3VV1GnU=
=jO/g
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: quest for bld_line (was: Re: Stuff to do)
  2006-10-08 15:38 ` quest for bld_line (was: Re: Stuff to do) Andrey Borzenkov
@ 2006-10-09 12:00   ` Peter Stephenson
  2006-10-09 16:28     ` Andrey Borzenkov
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Stephenson @ 2006-10-09 12:00 UTC (permalink / raw)
  To: Zsh hackers list

Andrey Borzenkov <arvidjaar@newmail.ru> wrote:
> The point of patch is to replace exhaustive enumeration of all possible 
> combinations by comparison of patterns. I.e. it checks if two patterns may 
> have something in common - this can be generalized later using different 
> pattern representations.
> 
> I would be happy if we could just toss away this function.
> 
> Comments?

I don't know enough about this stage of completion to comment in any
detail, but it certainly looks like we need to do something about
it... I'll try this locally.  (There's a word wrapped early on which is
preventing it applying but the fix is obvious.)

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


To access the latest news from CSR copy this link into a web browser:  http://www.csr.com/email_sig.php


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: quest for bld_line (was: Re: Stuff to do)
  2006-10-09 12:00   ` Peter Stephenson
@ 2006-10-09 16:28     ` Andrey Borzenkov
  2006-10-11 17:54       ` Andrey Borzenkov
  0 siblings, 1 reply; 12+ messages in thread
From: Andrey Borzenkov @ 2006-10-09 16:28 UTC (permalink / raw)
  To: zsh-workers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Monday 09 October 2006 16:00, Peter Stephenson wrote:
> Andrey Borzenkov <arvidjaar@newmail.ru> wrote:
> > The point of patch is to replace exhaustive enumeration of all possible
> > combinations by comparison of patterns. I.e. it checks if two patterns
> > may have something in common - this can be generalized later using
> > different pattern representations.
> >
> > I would be happy if we could just toss away this function.
> >
> > Comments?
>
> I don't know enough about this stage of completion to comment in any
> detail, but it certainly looks like we need to do something about
> it... 

to avoid duplication of efforts - I know now how to do it properly (that is, 
to preserve full functionality without relying on current fixed 256 byte 
tables), I just have to find some time to write it down.

> I'll try this locally.  (There's a word wrapped early on which is 
> preventing it applying but the fix is obvious.)

sorry for this; every time I hope kmail has more common sense to not do it
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFFKnjCR6LMutpd94wRAj/NAJ9yLWog2MQ1ZY//ri6qEU2+Gao/RwCfbMo5
GtcjO5630G0KqQJQzC/VF1g=
=A6m8
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: quest for bld_line (was: Re: Stuff to do)
  2006-10-09 16:28     ` Andrey Borzenkov
@ 2006-10-11 17:54       ` Andrey Borzenkov
  0 siblings, 0 replies; 12+ messages in thread
From: Andrey Borzenkov @ 2006-10-11 17:54 UTC (permalink / raw)
  To: zsh-workers


[-- Attachment #1.1: Type: text/plain, Size: 1774 bytes --]

On Monday 09 October 2006 20:28, Andrey Borzenkov wrote:
> On Monday 09 October 2006 16:00, Peter Stephenson wrote:
> > Andrey Borzenkov <arvidjaar@newmail.ru> wrote:
> > > The point of patch is to replace exhaustive enumeration of all possible
> > > combinations by comparison of patterns. I.e. it checks if two patterns
> > > may have something in common - this can be generalized later using
> > > different pattern representations.
> > >
> > > I would be happy if we could just toss away this function.
> > >
> > > Comments?
> >
> > I don't know enough about this stage of completion to comment in any
> > detail, but it certainly looks like we need to do something about
> > it...
>
> to avoid duplication of efforts - I know now how to do it properly (that
> is, to preserve full functionality without relying on current fixed 256
> byte tables), I just have to find some time to write it down.
>

OK so here is the updated patch; all tests passed; unfortunately I still do 
not know how to write regression tests that would specifically check this 
function.

I dare to presume that code became legible enough to be understandable now. It 
checks if it can match next character(s) and then recursively tries to match 
the rest. The net effect is that instead of answering "how to find all 
characters matching given pattern" we have to answer "can two patterns match 
the same character". It is obvious for literal character, ? and equivalence 
class; it is also obvious for future [:upper:] and [:lower:] patterns; and it 
is straightforward (albeit not necessarily fast) for [...] set.

Question - how does zsh internally represent [...] now? If this is a (set of) 
bit strings, comparison is trivial.

Comments (now attached)?

[-- Attachment #1.2: bld_line.diff --]
[-- Type: text/x-diff, Size: 11322 bytes --]

Index: Src/Zle/compmatch.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/Zle/compmatch.c,v
retrieving revision 1.50
diff -u -p -r1.50 compmatch.c
--- Src/Zle/compmatch.c	30 Sep 2006 06:53:15 -0000	1.50
+++ Src/Zle/compmatch.c	11 Oct 2006 17:43:58 -0000
@@ -1087,51 +1087,70 @@ comp_match(char *pfx, char *sfx, char *w
     return r;
 }
 
-/* Check if the given pattern matches the given string.             *
- *  p and  s are either anchor or line pattern and string;
+/*
+ * Try to match single pair of characters from line and word
+ * Either of line or word patterns (and characters) may be missing
+ */
+static int
+cm_try_single(Cpattern lp, char lc, Cpattern wp, char wc)
+{
+    char li = 0, wi = 0;
+
+    if (lp) {
+	li = lp->tab[STOUC(lc)];
+	if (!li)
+	    return 0;
+    }
+    if (wp) {
+	wi = wp->tab[STOUC(wc)];
+	if (!wi)
+	    return 0;
+    }
+
+    /* Equiv is expected to be defined on both sides */
+    if (lp && lp->equiv && wp && wp->equiv && li != wi)
+	return 0;
+
+    return 1;
+}
+
+/*
+ * Check if the given pattern matches the given string.             *
+ * lp and ls are either anchor or line pattern and string;
  * wp and ws are word (candidate) pattern and string
- *
- * If only one pattern is given, we just check if characters match
- * If both line and word are given, we check that characters match
- * for {...} classes by comparing relative numbers in sequence.
- *
- * Patterns and strings are always passed in pairs, so it is enough
- * to check for non-NULL wp. p should always be present.
  */
 
 /**/
 mod_export int
-pattern_match(Cpattern p, char *s, Cpattern wp, char *ws)
+pattern_match(Cpattern lp, char *ls, Cpattern wp, char *ws)
 {
-    unsigned char c;
-    unsigned char wc;
-
-    while (p && wp && *s && *ws) {
-	c = p->tab[*((unsigned char *) s)];
-	wc = wp->tab[*((unsigned char *) ws)];
-
-	if (!c || !wc || c != wc)
+    while (lp && wp && *ls && *ws) {
+	if (!cm_try_single(lp, *ls, wp, *ws))
 	    return 0;
 
-	s++;
+	ls++;
 	ws++;
-	p = p->next;
+	lp = lp->next;
 	wp = wp->next;
     }
 
-    while (p && *s) {
-	if (!p->tab[*((unsigned char *) s)])
+    while (lp && *ls) {
+	if (!cm_try_single(lp, *ls, NULL, '\0'))
 	    return 0;
-	p = p->next;
-	s++;
+	lp = lp->next;
+	ls++;
     }
 
     while (wp && *ws) {
-	if (!wp->tab[*((unsigned char *) ws)])
+	if (!cm_try_single(NULL, '\0', wp, *ws))
 	    return 0;
 	wp = wp->next;
 	ws++;
     }
+    
+    /* If any line is shorter than pattern, return falure */
+    if ((lp && !*ls) || (wp && !*ws))
+	return 0;
 
     return 1;
 }
@@ -1214,106 +1233,164 @@ bld_parts(char *str, int len, int plen, 
     return ret;
 }
 
-/* This builds all the possible line patterns for the pattern pat in the
- * buffer line. Initially line is the same as lp, but during recursive
- * calls lp is incremented for storing successive characters. Whenever
- * a full possible string is build, we test if this line matches the
- * string given by wlen and word.
+/*
+ * Compare two different line patterns if they can have some common character
+ * Insert one of common characters in line we are building (it does not matter
+ * which one)
+ * mlp	- line pattern which has matched before
+ * mwp	- word pattern which has matched before
+ * nlp	- new line pattern that we currently test against mlp
+ * nwp	- new word pattern that we currently test against mwp
+ * line	- line we build; we insert characters there
+ */
+
+/**/
+static int
+pattern_compare(Cpattern mlp, Cpattern mwp, Cpattern nlp, Cpattern nwp,
+		char *line)
+{
+    while (nlp) {
+	int i;
+
+	/*
+	 * test to see if mlp and nlp have something in commons
+	 * nlp cannot be less than mlp (we check pattern length before)
+	 * but word pattern may of course be shorter than line ...
+	 */
+	for (i = 0; i < 256; i++)
+	    if (mlp->tab[i] && nlp->tab[i]) {
+		/* for equiv. class they must also match word pattern */
+		if (mlp->equiv) {
+		    if (!mwp || !nwp || (mlp->tab[i] == mwp->tab[i] &&
+			nlp->tab[i] == nwp->tab[i]))
+			break;
+		} else
+		    break;
+	    }
+	if (i < 256) {
+	    /* OK we found character that matches both matchers */
+	    *line++ = (char)i;
+	} else {
+	    /* No matching character */
+	    return 0;
+	}
+	mlp = mlp->next;
+	nlp = nlp->next;
+	if (mwp) mwp = mwp->next;
+	if (nwp) nwp = nwp->next;
+    }
+
+    return 1;
+}
+
+/* This tries to find out, if there is common line that may match two
+ * words (possible matches or parts thereof). When this function is called,
+ * it is ensured that `mword' has matched word pattern in `matcher';
+ * we try to find a string that both matches line pattern in `matcher'
+ * and another word `word'
  *
- * wpat contains pattern that matched previously
- * lpat contains the pattern for line we build
- * mword is a string that matched wpat before
- * word is string that we try to match now
+ * matcher - matcher that `mword' has been matched against
+ * line    - buffer for string we build
+ * lmatch  - length of line built so far
+ * wmatch  - length of word matching the above
+ * mword   - word that has matched word pattern in `matcher' before
+ * word    - is string that we try to match now
+ * wlen    - length of `word'
+ * sfx     - if we should match bacwards
  *
- * The return value is the length of the string matched in the word, it
+ * The return value is the length of the string matched in the `word', it
  * is zero if we couldn't build a line that matches the word.
  */
 
-
 /**/
 static int
-bld_line(Cpattern wpat, Cpattern lpat, char *line, char *lp,
+bld_line(Cmatcher matcher,  char *line, int lmatch, int wmatch,
 	 char *mword, char *word, int wlen, int sfx)
 {
-    if (lpat) {
-	/* Still working on the pattern. */
-
-	int i, l;
-	unsigned char c = 0;
+    static Cpattern *mlpa;
+    static Cpattern *mwpa;
+    Cmlist ms;
+    Cmatcher mp;
+    Cpattern pat;
+    char *lp;
+    int llen = matcher->llen, rl = 0, i, clpos, cwpos;
 
-	/* Get the number of the character for a correspondence class
-	 * if it has a corresponding class. */
-	if (lpat->equiv)
-	    if (wpat && *mword) {
-		c = wpat->tab[STOUC(*mword)];
-		wpat = wpat->next;
-		mword++;
-	    }
+    if (!lmatch) {
+	/*
+	 * Setup array instead of list; this is required for suffix matchs
+	 * Array size is that of line; we are not interested in patterns
+	 * above this length
+	 */
+	mlpa = zalloc(sizeof(Cpattern) * llen);
+	mwpa = zalloc(sizeof(Cpattern) * llen);
 
+	for (i = 0, pat = matcher->line; pat; i++, pat = pat->next)
+	    mlpa[i] = pat;
+	for (i = 0, pat = matcher->word; pat && i < llen; i++, pat = pat->next)
+	    mwpa[i] = pat;
+	/* In case mword is smaller than line ... */
+	for (i = matcher->wlen; i < llen; i++)
+	    mwpa[i] = NULL;
+    }
 
-	/* Walk through the table in the pattern and try the characters
-	 * that may appear in the current position. */
-	for (i = 0; i < 256; i++)
-	    if ((lpat->equiv && c) ? (c == lpat->tab[i]) : lpat->tab[i]) {
-		*lp = i;
-		/* We stored the character, now call ourselves to build
-		 * the rest. */
-		if ((l = bld_line(wpat, lpat->next, line, lp + 1,
-				  mword, word, wlen, sfx)))
-		    return l;
-	    }
+    if (sfx) {
+	lp = line + llen - lmatch;
+	clpos = llen - lmatch - 1;
+	cwpos = llen - wmatch - 1;
     } else {
-	/* We reached the end, i.e. the line string is fully build, now
-	 * see if it matches the given word. */
-
-	Cmlist ms;
-	Cmatcher mp;
-	int l = lp - line, t, rl = 0, ind, add;
-
-	/* Quick test if the strings are exactly the same. */
-	if (l == wlen && !strncmp(line, word, l))
-	    return l;
-
-	if (sfx) {
-	    line = lp; word += wlen;
-	    ind = -1; add = -1;
-	} else {
-	    ind = 0; add = 1;
+	lp = line + lmatch;
+	clpos = lmatch;
+	cwpos = wmatch;
+    }
+
+    /*
+     * First try if next character in word fits mword directly;
+     * if yes, try to build the rest of line
+     * Do it only if word is at least as large as line
+     */
+    if (wmatch + 1 <= llen &&
+	cm_try_single(mlpa[clpos], word[clpos],
+		      mwpa[clpos], mword[clpos])) {
+	/* last character? */
+	if (lmatch + 1 == llen) {
+	    rl = 1;
+	    goto cleanup;
 	}
-	/* We loop through the whole line string built. */
-	while (l && wlen) {
-	    if (word[ind] == line[ind]) {
-		/* The same character in both strings, skip over. */
-		line += add; word += add;
-		l--; wlen--; rl++;
-	    } else {
-		t = 0;
-		for (ms = bmatchers; ms && !t; ms = ms->next) {
-		    mp = ms->matcher;
-		    if (mp && !mp->flags && mp->wlen <= wlen && mp->llen <= l &&
-			pattern_match(mp->line, (sfx ? line - mp->llen : line),
-				      mp->word, (sfx ? word - mp->wlen : word))) {
-			/* Both the line and the word pattern matched,
-			 * now skip over the matched portions. */
-			if (sfx) {
-			    line -= mp->llen; word -= mp->wlen;
-			} else {
-			    line += mp->llen; word += mp->wlen;
-			}
-			l -= mp->llen; wlen -= mp->wlen; rl += mp->wlen;
-			t = 1;
-		    }
-		}
-		if (!t)
-		    /* Didn't match, give up. */
-		    return 0;
+	if ((rl = bld_line(matcher, line, lmatch + 1, wmatch + 1,
+			   mword, word, wlen, sfx)) >= 0) {
+	    rl++;
+	    goto cleanup;
+	}
+    }
+
+    /* nope; try each matcher in turn and see if we can match with it */
+    for (ms = bmatchers; ms; ms = ms->next) {
+	mp = ms->matcher;
+	if (mp && !mp->flags && mp->wlen <= wlen - wmatch && mp->llen <= llen - lmatch &&
+	    pattern_match(mp->word, (sfx ? word - wmatch - mp->wlen : word + wmatch),
+			  NULL, NULL) &&
+	    pattern_compare(mlpa[clpos], mwpa[clpos], mp->line, mp->word,
+			    (sfx ? lp - mp->llen : lp))) {
+	    /* Both the line and the word pattern matched,
+	     * now try to match the rest */
+	    rl = bld_line(matcher, line, lmatch + mp->llen, wmatch + mp->wlen,
+			  mword, word, wlen, sfx);
+	    if (rl) {
+		rl += mp->wlen;
+		goto cleanup;
 	    }
 	}
-	if (!l)
-	    /* Unmatched portion in the line built, return matched length. */
-	    return rl;
     }
-    return 0;
+
+    /* Nothing match; return failure */
+
+cleanup:
+    if (!lmatch) {
+	zfree(mlpa, sizeof(Cpattern) * matcher->llen);
+	zfree(mwpa, sizeof(Cpattern) * matcher->wlen);
+    }
+
+    return rl;
 }
 
 /* This builds a string that may be put on the line that fully matches the
@@ -1357,7 +1434,7 @@ join_strs(int la, char *sa, int lb, char
 			}
 			/* Now try to build a string that matches the other
 			 * string. */
-			if ((bl = bld_line(mp->word, mp->line, line, line,
+			if ((bl = bld_line(mp, line, 0, 0,
 					   *ap, *bp, *blp, 0))) {
 			    /* Found one, put it into the return string. */
 			    line[mp->llen] = '\0';
@@ -1560,7 +1637,7 @@ join_sub(Cmdata md, char *str, int len, 
 		    else
 			mw = nw - (sfx ? mp->wlen : 0);
 
-		    if ((bl = bld_line(mp->word, mp->line, line, line,
+		    if ((bl = bld_line(mp, line, 0, 0,
 				       mw, (t ? nw : ow), (t ? nl : ol), sfx)))  {
 			/* Yep, one of the lines matched the other
 			 * string. */

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2006-10-11 17:55 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-09-27 12:11 Stuff to do Peter Stephenson
2006-09-27 13:09 ` DervishD
2006-09-29  3:20   ` Bart Schaefer
2006-09-29  8:05     ` DervishD
2006-09-29 16:37 ` Andrey Borzenkov
2006-09-29 17:08   ` Peter Stephenson
2006-09-29 18:08     ` Andrey Borzenkov
2006-09-29 18:08 ` Andrey Borzenkov
2006-10-08 15:38 ` quest for bld_line (was: Re: Stuff to do) Andrey Borzenkov
2006-10-09 12:00   ` Peter Stephenson
2006-10-09 16:28     ` Andrey Borzenkov
2006-10-11 17:54       ` Andrey Borzenkov

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).