compmatch behaviour

zsh-workers
 help / color / mirror / code / Atom feed

* compmatch behaviour
@ 2008-05-18 19:01 Peter Stephenson
  2008-05-18 23:57 ` Bart Schaefer
  2008-05-21 17:28 ` Andrey Borzenkov
  0 siblings, 2 replies; 5+ messages in thread
From: Peter Stephenson @ 2008-05-18 19:01 UTC (permalink / raw)
  To: Zsh hackers list

Plea for help.

I usually face the completion code on my own, but I'm just looking at
trying to update the matching code to make it more general.  (I believe
I'm already relying on a change of Andrey's that simplifies it in one
respect.  Thank you.)

However, there's one bit that's got me stumped, and unfortunately it's
the core of the whole business.  bld_line() in Src/Zle/compmatch.c works
as follows:

- Input a "word pattern" (the test completion) and a "line pattern" (what
  we're matching it against).
- If we haven't yet got to the end of the line pattern
  - If the line pattern is an equivalence class, then for *every* character
    that can match the character in the test word (yes, you read that
    correctly---if we're looking at upper case characters, for example,
    we will try every possible upper case character until it works)
    - set the character in the string from the command line
    - recurse to test with this character in place with the line
      pattern advanced but the same word pattern
    - if it succeeded, return success.
  - If it's not an equivalence class, no problem: only one character
    to try.  Try it (same recursive logic but no nasty loop).
- If we've got to the end of the line pattern (i.e. have recursed to the
  extent where we've got a complete string from the command line,
  - try matching the test completion and the trial word
  - return success or failure.  (This causes the loop above either
    to return success or try with a new character in the equivalence
    class.)

In other words, to detect a match we try every possible character that
could possibly match and see if it does.  This is crazy.  Obviously this
doesn't generalize to larger groups of characters.

I think the basic reason for this is something along the lines of the
following (I realise this isn't particularly coherent but this is the
best I've got for now): because we can have patterns associated with
both the trial string and the word on the command line, we have got
ourselves into a position where the logic is naturally qudratic: both
sides can in principle change and consequently we need to change one
side to see if it can match the other.

The code for bld_line() is in Src/Zle/compmatch.c.  If anyone can see a
way out of this mess, I'd be glad to hear even tentative theories.
Obviously, I will continue to look at it.  For now, however, I'm
stumped.

Another problem is that the match code makes extensive use of lengths,
which need to become character counts, which means that anything that
touches this code needs to use wide characters, which is a lot of
tortuous code.  However, that problem is in principle soluble.  We need
to get the first problem solved.

-- 
Peter Stephenson <p.w.stephenson@ntlworld.com>
Web page now at http://homepage.ntlworld.com/p.w.stephenson/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: compmatch behaviour
  2008-05-18 19:01 compmatch behaviour Peter Stephenson
@ 2008-05-18 23:57 ` Bart Schaefer
  2008-05-19 10:34   ` Peter Stephenson
  2008-05-21 17:28 ` Andrey Borzenkov
  1 sibling, 1 reply; 5+ messages in thread
From: Bart Schaefer @ 2008-05-18 23:57 UTC (permalink / raw)
  To: Zsh hackers list

On May 18,  8:01pm, Peter Stephenson wrote:
}
} However, there's one bit that's got me stumped, and unfortunately it's
} the core of the whole business.  bld_line() in Src/Zle/compmatch.c works
} as follows:

So if I comprehend your question, it's not that you need help figuring
out what the code is doing.  You want help figuring out what it should
do instead, because the space of all possible wide characters is much
to big to brute-force it the way Sven originally did.

Right?

} ... because we can have patterns associated with both the trial
} string and the word on the command line, we have got ourselves into
} a position where the logic is naturally qudratic: both sides can in
} principle change and consequently we need to change one side to see if
} it can match the other.

I'm not sure this is quite right (so maybe it's just your coherency or
my comprehension that's off).  There are two situations being handled
simultaneously here, and maybe the first thing to do is to separate
them.  The first situation is where wpat is a correspondence class
and we need to select the corresponding position out of lpat.  The
second case is where lpat is an equivalence class and we need to try
every possible character in the class at line position *lp.

The two cases don't actually overlap as far as I can tell -- Sven has
branched the same loop that searches for c in lpat->tab in the first
case, to do double duty as the loop that acts on every character in
lpat->tab in the second case.

Either (but not both) of the two situations could occur at every value
of lp, which is what the recursion is covering.  This is limited by the
length of the line --- there might be an optimization opportunity in
testing sooner whether *(lp + 1) will be the end of line, but the depth
of the recursion is not related to the size of the character class.

I think the first think that's needed is to change the Cpattern struct
from a dense array indexed by the ascii value of a character, into ...
well, something else, so that it's not necessary to iterate over it
in the first case, and so the iteration is more sparse in the second
case.  Of course, that may make pattern_match() more complicated ....

I'm not sure there's any way to avoid iterating in the second case.
Besides handling two possible ways of interpreting the character class,
I think (without tracing very far up the call chain) that this is also
constructing the prefix string that's going to be shown to the user in
the event of a common prefix shared by ambiguous matches.  It's not
enough to check whether the test character is in the class, you might
have to know *which* character it is in the class, so to speak.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: compmatch behaviour
  2008-05-18 23:57 ` Bart Schaefer
@ 2008-05-19 10:34   ` Peter Stephenson
  2008-05-19 15:11     ` Bart Schaefer
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Stephenson @ 2008-05-19 10:34 UTC (permalink / raw)
  To: Zsh hackers list

On Sun, 18 May 2008 16:57:53 -0700
Bart Schaefer <schaefer@brasslantern.com> wrote:
> So if I comprehend your question, it's not that you need help figuring
> out what the code is doing.  You want help figuring out what it should
> do instead, because the space of all possible wide characters is much
> to big to brute-force it the way Sven originally did.
> 
> Right?

Exactly.  Thanks for looking.

> There are two situations being handled
> simultaneously here, and maybe the first thing to do is to separate
> them.  The first situation is where wpat is a correspondence class
> and we need to select the corresponding position out of lpat.  The
> second case is where lpat is an equivalence class and we need to try
> every possible character in the class at line position *lp.

Hmm... terminology first... Sven's "correspondence class" appears to be the
one with the "equiv" flag set, i.e. {...}.  So here the characters are
numbered and we are searching for a particular one.  So currently (no
generic character sets) we can do this without looping (conceptually---the
implementation needs a loop to find the character but only needs to test
that one character).

However, in my rewrite I want to be able to say "any upper case character"
so that it can match the corresponding lower case character.  I can
count through the characters or [:stuff:] elements taking account of
ranges specially... but then I may just have a pattern such as [:upper:],
so I still need to match against the line.  So this becomes like the other
case.

(In case that's not clear: in the existing implementation we have "{a-z}"
where a is indexed 1, b is 2, etc. etc.  Now I have "{[:upper:]}".  Here
the only index is 1, but it comes along with a pattern matching instruction
"match any upper case character", and on the other side of the correpondence
"{[:lower:]}" is interpreted specially when we call pattern_match() as "the
corresponding lower case character".)

> I think the first think that's needed is to change the Cpattern struct
> from a dense array indexed by the ascii value of a character, into ...
> well, something else, so that it's not necessary to iterate over it
> in the first case, and so the iteration is more sparse in the second
> case.  Of course, that may make pattern_match() more complicated ....

I've already done this, that's the easy bit.  The two flies in the ointment
are this loop and wide character modifications.

> I'm not sure there's any way to avoid iterating in the second case.
> Besides handling two possible ways of interpreting the character class,
> I think (without tracing very far up the call chain) that this is also
> constructing the prefix string that's going to be shown to the user in
> the event of a common prefix shared by ambiguous matches.  It's not
> enough to check whether the test character is in the class, you might
> have to know *which* character it is in the class, so to speak.

(so you can see how this reflects on my "{[:upper:]}" problem for
correspondence classes.)

Even if iteration is necessary, which is plausible, I can't help thinking
there should be a way of iterating over positions in the word/line and
trying to match at each point with the pattern, looking to see which
character matched, rather than blindly trying characters at each position,
which won't work for every upper case character in Unicode.  If we can work
out how to change it along those lines then I think we've solved it.

(This is certainly a different problem from quadratic behaviour; it's more
along the lines of the way the ${foo//.../...} code works, or at least I
hope it is because that problem is soluble.)

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: compmatch behaviour
  2008-05-19 10:34   ` Peter Stephenson
@ 2008-05-19 15:11     ` Bart Schaefer
  0 siblings, 0 replies; 5+ messages in thread
From: Bart Schaefer @ 2008-05-19 15:11 UTC (permalink / raw)
  To: Zsh hackers list

On May 19, 11:34am, Peter Stephenson wrote:
}
} Bart Schaefer <schaefer@brasslantern.com> wrote:
} > There are two situations being handled
} > simultaneously here, and maybe the first thing to do is to separate
} > them.  The first situation is where wpat is a correspondence class
} > and we need to select the corresponding position out of lpat.  The
} > second case is where lpat is an equivalence class and we need to try
} > every possible character in the class at line position *lp.
} 
} Hmm... terminology first... Sven's "correspondence class" appears
} to be the one with the "equiv" flag set, i.e. {...}. So here the
} characters are numbered and we are searching for a particular one.

Actually Sven has, again, overloaded something with a similar structure
to serve multiple purposes.  There are two possible cases:
(1) lpat->equiv is false OR wpat does not exist: an equivalence class.
    [every existing character position in lpat->tab is tried at *lp]
(2) wpat exists AND lpat->equiv is true: a correspondence class.
    [the character wpat->tab[*mword] must have a position in lpat->tab]

Case (1) also occurs as a degenerate of (2) when there is no character
in wpat for the current character in mword.  I'm not sure why that's
correct.

} However, in my rewrite I want to be able to say "any upper case
} character" so that it can match the corresponding lower case
} character.

If it's only going to match the corresponding lower case character, then
you have [:upper:] in wpat->tab and you need to simulate case (2) above.
If your lookup in lpat->tab returns [:lower:], convert *mword to lower
case and you're done.  I have no idea how you plan to handle something
like [:upper:] mapping to [:digit:], though.  There's a reason Sven
chose to require enumeration: this works more like "tr" than like "sed".
The two classes in case (2) ought to have the same number of values,
because its the positions in each class that have to match up.

The bit you're worried about, though, is when you have [:upper:] in
lpat->tab and either no wpat or no character in wpat->tab for *mword.
Then you need to try all the possible upper case characters.  Sven's
algorithm seems to be to build every possible combination all the way
out to the end of the line and then compare entire words, discarding
non-matches.  I would think it's possible to try matching the prefix
so far, so that you can short-circuit the rest of the process on a
non-match.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: compmatch behaviour
  2008-05-18 19:01 compmatch behaviour Peter Stephenson
  2008-05-18 23:57 ` Bart Schaefer
@ 2008-05-21 17:28 ` Andrey Borzenkov
  1 sibling, 0 replies; 5+ messages in thread
From: Andrey Borzenkov @ 2008-05-21 17:28 UTC (permalink / raw)
  To: zsh-workers

[-- Attachment #1: Type: text/plain, Size: 4351 bytes --]

On Sunday 18 May 2008, Peter Stephenson wrote:
> 
> Plea for help.
> 
> I usually face the completion code on my own, but I'm just looking at
> trying to update the matching code to make it more general.  (I believe
> I'm already relying on a change of Andrey's that simplifies it in one
> respect.  Thank you.)
> 
> However, there's one bit that's got me stumped, and unfortunately it's
> the core of the whole business.  bld_line() in Src/Zle/compmatch.c works
> as follows:
> 
> - Input a "word pattern" (the test completion) and a "line pattern" (what
>   we're matching it against).
> - If we haven't yet got to the end of the line pattern
>   - If the line pattern is an equivalence class,

Nope, this is other way round. Equivalence (actually it is called
correspondence in documentation) class is simply reduced to standard
character class; this condition later returns a single match:

(lpat->equiv && c) ? (c == lpat->tab[i])

The problem we have is *really* with standard character class.

>                                                  then for *every* character 
>     that can match the character in the test word (yes, you read that
>     correctly---if we're looking at upper case characters, for example,
>     we will try every possible upper case character until it works)

Nope, we will check only uppercase variant of character in the commad line

>     - set the character in the string from the command line
>     - recurse to test with this character in place with the line
>       pattern advanced but the same word pattern
>     - if it succeeded, return success.
>   - If it's not an equivalence class, no problem: only one character
>     to try.  Try it (same recursive logic but no nasty loop).
> - If we've got to the end of the line pattern (i.e. have recursed to the
>   extent where we've got a complete string from the command line,
>   - try matching the test completion and the trial word
>   - return success or failure.  (This causes the loop above either
>     to return success or try with a new character in the equivalence
>     class.)
> 
> In other words, to detect a match we try every possible character that
> could possibly match and see if it does.  This is crazy.  Obviously this
> doesn't generalize to larger groups of characters.
> 
> I think the basic reason for this is something along the lines of the
> following (I realise this isn't particularly coherent but this is the
> best I've got for now): because we can have patterns associated with
> both the trial string and the word on the command line, we have got
> ourselves into a position where the logic is naturally qudratic: both
> sides can in principle change and consequently we need to change one
> side to see if it can match the other.
> 
> The code for bld_line() is in Src/Zle/compmatch.c.  If anyone can see a
> way out of this mess, I'd be glad to hear even tentative theories.

I tried to replace this function with something else and failed because I
still do not understand most of this file. Some clue what's going on may
give comment in join_strs() (where one use case for bld_line is):

/* This builds a string that may be put on the line that fully matches the
 * given strings. The return value is NULL if no such string could be built
 * or that string in local static memory, dup it. */

So we *actually* try to *build* line for whatever reasons; and this
perfectly explains what happens in bld_line(). But I failed to grasp full
sequence of transformations that happen in this file so I must admit I
do not understand why and when would we need to do this.

> Obviously, I will continue to look at it.  For now, however, I'm
> stumped.
> 
> Another problem is that the match code makes extensive use of lengths,
> which need to become character counts, which means that anything that
> touches this code needs to use wide characters,

Hmm ... but if we ever want to move to "normal" pattern mathing code it
operates on multibyte not wide characters, does not it? This would imply
quite a lot of coversions; and some functions from compmatch.c are also
used in very unexpected places.

> which is a lot of 
> tortuous code.  However, that problem is in principle soluble.  We need
> to get the first problem solved.
> 



[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-05-21 17:29 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-05-18 19:01 compmatch behaviour Peter Stephenson
2008-05-18 23:57 ` Bart Schaefer
2008-05-19 10:34   ` Peter Stephenson
2008-05-19 15:11     ` Bart Schaefer
2008-05-21 17:28 ` Andrey Borzenkov

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).