* Re: filename completion with umlauts (again) [not found] ` <20110108202122.5decaa0b@pws-pc.ntlworld.com> @ 2011-01-08 22:22 ` Bart Schaefer 2011-01-08 23:21 ` Peter Stephenson 0 siblings, 1 reply; 4+ messages in thread From: Bart Schaefer @ 2011-01-08 22:22 UTC (permalink / raw) To: zsh-workers [>workers] On Jan 8, 8:21pm, Peter Stephenson wrote: } } The remaining problem is the multibyte one; the matcher code is heavily } tied to one character per array position in a way that doesn't make it } easy to turn multibyte into wide characters and back (and that doesn't } always make it obvious what the @*!@! it's actually doing with the } array). "The array" ... Digging through the list archives I find a reference to "the characters stored in the matcher are not handled as multibyte" but parse_pattern() seems to be converting multibyte input to convchar_t so that's not it any longer. (Is it?) Hence it must be genpatarr in bld_line(), and the problem is that even though we can determine correctly that the left-side of the equivalence class matches the original character on the line, we can't select the appropriate corresponding character from the right-side of the class? Which implies that the root of the problem is mb_patchmatchindex() in Src/pattern.c, and what I said before really is true: It's not simple to expand an "a-z" style representation into an enumeration of all the characters within the range, figure out that it's the Nth position in the expansion, and then find the corresponding Nth position in another range, when either or both ranges might be multibyte; and even if it were possible to select the correct position in both ranges it's unclear when to convert the result back to multibyte. } The collating order might be potentially a problem if you use literal } characters, but that's already fixed in a general way by allowing the } syntax: } } m:{[:upper:][:lower:]}={[:lower:][:upper:]} The syntax is supported but the handling doesn't appear to be special- cased; mb_patmatchindex() does not differ from patchmatchindex() in its handling of PP_UPPER or PP_LOWER and assumes ranges are numerically contiguous. What is it that I continue to fail to see? BTW in the comments before compmatch.c:pattern_match_restrict() there's a reference to "s will be NULL" but there is no variable or argument "s". I suspect it must mean "wsc". -- ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: filename completion with umlauts (again) 2011-01-08 22:22 ` filename completion with umlauts (again) Bart Schaefer @ 2011-01-08 23:21 ` Peter Stephenson 2011-01-09 0:48 ` Bart Schaefer 0 siblings, 1 reply; 4+ messages in thread From: Peter Stephenson @ 2011-01-08 23:21 UTC (permalink / raw) To: zsh-workers On Sat, 08 Jan 2011 14:22:59 -0800 Bart Schaefer <schaefer@brasslantern.com> wrote: > } The collating order might be potentially a problem if you use literal > } characters, but that's already fixed in a general way by allowing the > } syntax: > } > } m:{[:upper:][:lower:]}={[:lower:][:upper:]} > > The syntax is supported but the handling doesn't appear to be special- > cased; mb_patmatchindex() does not differ from patchmatchindex() in its > handling of PP_UPPER or PP_LOWER and assumes ranges are numerically > contiguous. The relevant code is in Src/Zle/compmatch.c. (There are some references to matchers in other parts of the completion code, and there's a little bit of extra help from the regular expression code but that's fairly trivial.) Equivalence classes are handled by pattern_match_equivalence(). In every other place equivalence classes are treated identically to normal character classes. > What is it that I continue to fail to see? See any number of while loops over character arrays in compmatch.c; as one example, the loop at line 529 in match_str(). The various arrays are simply char *'s and they're not even metafied (if I remember right; that's how we support 8-bit single byte encodings, by direct comparison). The place is full of expressions like "w + aoff - aol" and "l[-(llen + zoff)]". All these arrays need to refer either to multibyte characters with appropriate arithmetic using mbsrtowcs() and friends, or need to be converted to wide characters and back at appropriate points, and in the latter case we need to convert everything relevant into wide characters and back again, in some cases potentially losing information since not everything on the command line is guaranteed to be a multibyte string corresponding to a valid character in the current locale. (For example, you can complete a file name containing ISO-8859-1 characters even when the locale is UTF-8; this should work even though the characters don't show up properly.) If you *can* prove it's trivial, of course... -- Peter Stephenson <p.w.stephenson@ntlworld.com> Web page now at http://homepage.ntlworld.com/p.w.stephenson/ ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: filename completion with umlauts (again) 2011-01-08 23:21 ` Peter Stephenson @ 2011-01-09 0:48 ` Bart Schaefer 2011-01-09 16:44 ` Peter Stephenson 0 siblings, 1 reply; 4+ messages in thread From: Bart Schaefer @ 2011-01-09 0:48 UTC (permalink / raw) To: zsh-workers On Jan 8, 11:21pm, Peter Stephenson wrote: } } > What is it that I continue to fail to see? } } See any number of while loops over character arrays in compmatch.c; as } one example, the loop at line 529 in match_str(). The various arrays } are simply char *'s and they're not even metafied Ah, so working my way up from the bottom I still haven't climbed far enough out of the pit. The stuff that performs comparisons is mostly MB-ified, but the code that chooses what needs to be compared is not. } If you *can* prove it's trivial, of course... Never intended a pretense of *that* claim ... just trying to get on record what it is that needs looking at. If we climb even further out of the hole we've dug, then it appears that for example match_str() is called on [a copy of] the comppprefix global variable, which, along with the compsuffix et al., is happily marched around with pointer arithmetic all over the completion code, not just in compmatch.c. Furthermore there are a bunch of globals declared in compmatch.c that keep track of various edits (for lack of a better description) that need to be applied to the command line to cause it to reflect various possible outcomes of completion. This includes stuff like "oh by the way I changed all your ~ into x to hide them from tilde expansion, you need to put them back again." Would it even be sufficient to metafy around the match_str() entry point, or is the real problem that the *entire* completion system needs to stop treating the input line as a (char *)? In which case we almost may as well start over from scratch. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: filename completion with umlauts (again) 2011-01-09 0:48 ` Bart Schaefer @ 2011-01-09 16:44 ` Peter Stephenson 0 siblings, 0 replies; 4+ messages in thread From: Peter Stephenson @ 2011-01-09 16:44 UTC (permalink / raw) To: zsh-workers On Sat, 08 Jan 2011 16:48:14 -0800 Bart Schaefer <schaefer@brasslantern.com> wrote: > Would it even be sufficient to metafy around the match_str() entry > point, or is the real problem that the *entire* completion system > needs to stop treating the input line as a (char *)? > > In which case we almost may as well start over from scratch. I think it is actually relatively localised, unfortunately still quite a lot of hard-to-understand code. Most of the completion system uses proper metafied strings and counts characters approprlately. It's only when we get to the matching stage and inserting the result into the command line that the boundaries blur a bit. I think apart from quite a lot of compmatch.c, chunks of compresult.c, and a few bits elsewhere, it doesn't actually need changing much. However, one of the problems is deciding quite where the boundaries are. A start could probably be made by introducing appropriate types so that we know where a char * is being treated as an array of individual characters rather than a generic, possibly multibyte, string --- I forget this every time I stop looking at it. -- Peter Stephenson <p.w.stephenson@ntlworld.com> Web page now at http://homepage.ntlworld.com/p.w.stephenson/ ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2011-01-09 16:45 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20110106232712.GA11387@spiegl.de> [not found] ` <AANLkTik9unZtuPR-4CM2oKLRT9Soct-XFWmiEajQzbK9@mail.gmail.com> [not found] ` <20110107094419.141d8d67@pwslap01u.europe.root.pri> [not found] ` <20110107233459.GA29168@spiegl.de> [not found] ` <110107231048.ZM919@torch.brasslantern.com> [not found] ` <20110108202122.5decaa0b@pws-pc.ntlworld.com> 2011-01-08 22:22 ` filename completion with umlauts (again) Bart Schaefer 2011-01-08 23:21 ` Peter Stephenson 2011-01-09 0:48 ` Bart Schaefer 2011-01-09 16:44 ` Peter Stephenson
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).