On Wednesday 27 September 2006 16:11, Peter Stephenson wrote:
> - The matcher specifications in completion don't handle multibyte
> characters and are currently written in such a way as to make this
> hard (similar to the old suffix character handling).

I have been looking at it recently. So far the following issues came up.

1. matcher code assumes character == byte and is using 256 bytes array to 
build character equivalence classes. What is worse, it is passing this array 
around between different functions to suppply results of previous matching. I 
have here patch (attached) that eliminates external dependency on this array 
so matcher internals can be more easily changed. This seems to make code a 
bit more understandable irrespectively :) OK to commit?

2. Usage of magic array for character classes ([abcd]) can be naturally 
superceded by using either generic pattern matching or direct comparison. 
Pattern matching provides for using something like [[:lower:]] and possibly 
using matchers etc but potential side effects of extended globbing need 
review. I do not know what is faster. Is it OK?

3. Equivalence classes ({abcd}={xyzw}) do not scale beyond single byte 
characters. But if we check usage I believe, it has never been used for 
anything beyond case-insensitive matching. For this particular usage I 
suggest using new matcher type:

m:LPAT>upper
m:LPAT>lower

with obvious semantic - character from line is converted to lower or upper and 
compared with character from potential match. So m:{a-z}={A-Z} becomes 
m:?>upper etc.

We still can implement {...} for character _set_ but not for character range. 
So far I do not consider it major problem.

4. The hardest part. Right anchor. For this matcher must match _backward_. I 
am not aware of any way to walk backward as long as we assume arbitrary 
encoding. Options apparently are

a) careful modification of code to compute line and patter length and try to 
match from the (llen - plen) point. After all we know that we do not need to 
match more than that. This may be doable, I did not yet try.

b) convert this code to use wide characters. Not sure if this is a viable 
option.

c) Use UTF-8 :) It is no joke - I expect nowadays 99% of all systems using 
multibyte encoding use UTF-8. So we may concentrate on most commonly used 
case and think about other encoding later if anyone really needs it. As soon 
as we assume input be in UTF-8 we immediately get enormous advantages

- it can be easily traversed both backwards and forwards (meta adds some 
complexity but not much)
- length is known in advance, we can save mbrtowc() calls
- if we know that system is using UCS-4 for wide characters (and I guess we do 
not support others at all) converting to/from wide character can be 
implemented without calling mbrtowc() & Co. at all.
- (remotely) this allows us to get rid of META which saves quite a bit of 
memory allocation overhead.

Comments?