On Wednesday 27 September 2006 16:11, Peter Stephenson wrote: > - The matcher specifications in completion don't handle multibyte > characters and are currently written in such a way as to make this > hard (similar to the old suffix character handling). I have been looking at it recently. So far the following issues came up. 1. matcher code assumes character == byte and is using 256 bytes array to build character equivalence classes. What is worse, it is passing this array around between different functions to suppply results of previous matching. I have here patch (attached) that eliminates external dependency on this array so matcher internals can be more easily changed. This seems to make code a bit more understandable irrespectively :) OK to commit? 2. Usage of magic array for character classes ([abcd]) can be naturally superceded by using either generic pattern matching or direct comparison. Pattern matching provides for using something like [[:lower:]] and possibly using matchers etc but potential side effects of extended globbing need review. I do not know what is faster. Is it OK? 3. Equivalence classes ({abcd}={xyzw}) do not scale beyond single byte characters. But if we check usage I believe, it has never been used for anything beyond case-insensitive matching. For this particular usage I suggest using new matcher type: m:LPAT>upper m:LPAT>lower with obvious semantic - character from line is converted to lower or upper and compared with character from potential match. So m:{a-z}={A-Z} becomes m:?>upper etc. We still can implement {...} for character _set_ but not for character range. So far I do not consider it major problem. 4. The hardest part. Right anchor. For this matcher must match _backward_. I am not aware of any way to walk backward as long as we assume arbitrary encoding. Options apparently are a) careful modification of code to compute line and patter length and try to match from the (llen - plen) point. After all we know that we do not need to match more than that. This may be doable, I did not yet try. b) convert this code to use wide characters. Not sure if this is a viable option. c) Use UTF-8 :) It is no joke - I expect nowadays 99% of all systems using multibyte encoding use UTF-8. So we may concentrate on most commonly used case and think about other encoding later if anyone really needs it. As soon as we assume input be in UTF-8 we immediately get enormous advantages - it can be easily traversed both backwards and forwards (meta adds some complexity but not much) - length is known in advance, we can save mbrtowc() calls - if we know that system is using UCS-4 for wide characters (and I guess we do not support others at all) converting to/from wide character can be implemented without calling mbrtowc() & Co. at all. - (remotely) this allows us to get rid of META which saves quite a bit of memory allocation overhead. Comments?