Re: filename completion with umlauts (again)

zsh-workers
 help / color / mirror / code / Atom feed

* Re: filename completion with umlauts (again)
       [not found]         ` <20110108202122.5decaa0b@pws-pc.ntlworld.com>
@ 2011-01-08 22:22           ` Bart Schaefer
  2011-01-08 23:21             ` Peter Stephenson
  0 siblings, 1 reply; 4+ messages in thread
From: Bart Schaefer @ 2011-01-08 22:22 UTC (permalink / raw)
  To: zsh-workers

[>workers]

On Jan 8,  8:21pm, Peter Stephenson wrote:
}
} The remaining problem is the multibyte one; the matcher code is heavily
} tied to one character per array position in a way that doesn't make it
} easy to turn multibyte into wide characters and back (and that doesn't
} always make it obvious what the @*!@! it's actually doing with the
} array).

"The array" ...

Digging through the list archives I find a reference to "the characters
stored in the matcher are not handled as multibyte" but parse_pattern() 
seems to be converting multibyte input to convchar_t so that's not it
any longer.  (Is it?)

Hence it must be genpatarr in bld_line(), and the problem is that even
though we can determine correctly that the left-side of the equivalence
class matches the original character on the line, we can't select the
appropriate corresponding character from the right-side of the class?

Which implies that the root of the problem is mb_patchmatchindex() in
Src/pattern.c, and what I said before really is true:  It's not simple
to expand an "a-z" style representation into an enumeration of all the
characters within the range, figure out that it's the Nth position in
the expansion, and then find the corresponding Nth position in another
range, when either or both ranges might be multibyte; and even if it were
possible to select the correct position in both ranges it's unclear when
to convert the result back to multibyte.

} The collating order might be potentially a problem if you use literal
} characters, but that's already fixed in a general way by allowing the
} syntax:
} 
}   m:{[:upper:][:lower:]}={[:lower:][:upper:]}

The syntax is supported but the handling doesn't appear to be special-
cased; mb_patmatchindex() does not differ from patchmatchindex() in its
handling of PP_UPPER or PP_LOWER and assumes ranges are numerically
contiguous.

What is it that I continue to fail to see?

BTW in the comments before compmatch.c:pattern_match_restrict() there's a
reference to "s will be NULL" but there is no variable or argument "s".
I suspect it must mean "wsc".

-- 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: filename completion with umlauts (again)
  2011-01-08 22:22           ` filename completion with umlauts (again) Bart Schaefer
@ 2011-01-08 23:21             ` Peter Stephenson
  2011-01-09  0:48               ` Bart Schaefer
  0 siblings, 1 reply; 4+ messages in thread
From: Peter Stephenson @ 2011-01-08 23:21 UTC (permalink / raw)
  To: zsh-workers

On Sat, 08 Jan 2011 14:22:59 -0800
Bart Schaefer <schaefer@brasslantern.com> wrote:
> } The collating order might be potentially a problem if you use literal
> } characters, but that's already fixed in a general way by allowing the
> } syntax:
> } 
> }   m:{[:upper:][:lower:]}={[:lower:][:upper:]}
> 
> The syntax is supported but the handling doesn't appear to be special-
> cased; mb_patmatchindex() does not differ from patchmatchindex() in its
> handling of PP_UPPER or PP_LOWER and assumes ranges are numerically
> contiguous.

The relevant code is in Src/Zle/compmatch.c.  (There are some references
to matchers in other parts of the completion code, and there's a little
bit of extra help from the regular expression code but that's fairly
trivial.)  Equivalence classes are handled by
pattern_match_equivalence().  In every other place equivalence classes
are treated identically to normal character classes.

> What is it that I continue to fail to see?

See any number of while loops over character arrays in compmatch.c; as
one example, the loop at line 529 in match_str().  The various arrays
are simply char *'s and they're not even metafied (if I remember right;
that's how we support 8-bit single byte encodings, by direct
comparison).  The place is full of expressions like "w + aoff - aol" and
"l[-(llen + zoff)]".  All these arrays need to refer either to multibyte
characters with appropriate arithmetic using mbsrtowcs() and friends, or
need to be converted to wide characters and back at appropriate points,
and in the latter case we need to convert everything relevant into wide
characters and back again, in some cases potentially losing information
since not everything on the command line is guaranteed to be a multibyte
string corresponding to a valid character in the current locale.  (For
example, you can complete a file name containing ISO-8859-1 characters
even when the locale is UTF-8; this should work even though the
characters don't show up properly.)

If you *can* prove it's trivial, of course...

-- 
Peter Stephenson <p.w.stephenson@ntlworld.com>
Web page now at http://homepage.ntlworld.com/p.w.stephenson/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: filename completion with umlauts (again)
  2011-01-08 23:21             ` Peter Stephenson
@ 2011-01-09  0:48               ` Bart Schaefer
  2011-01-09 16:44                 ` Peter Stephenson
  0 siblings, 1 reply; 4+ messages in thread
From: Bart Schaefer @ 2011-01-09  0:48 UTC (permalink / raw)
  To: zsh-workers

On Jan 8, 11:21pm, Peter Stephenson wrote:
}
} > What is it that I continue to fail to see?
} 
} See any number of while loops over character arrays in compmatch.c; as
} one example, the loop at line 529 in match_str().  The various arrays
} are simply char *'s and they're not even metafied

Ah, so working my way up from the bottom I still haven't climbed far
enough out of the pit.  The stuff that performs comparisons is mostly
MB-ified, but the code that chooses what needs to be compared is not.

} If you *can* prove it's trivial, of course...

Never intended a pretense of *that* claim ... just trying to get on
record what it is that needs looking at.

If we climb even further out of the hole we've dug, then it appears
that for example match_str() is called on [a copy of] the comppprefix
global variable, which, along with the compsuffix et al., is happily
marched around with pointer arithmetic all over the completion code,
not just in compmatch.c.

Furthermore there are a bunch of globals declared in compmatch.c that
keep track of various edits (for lack of a better description) that
need to be applied to the command line to cause it to reflect various
possible outcomes of completion.  This includes stuff like "oh by the
way I changed all your ~ into x to hide them from tilde expansion, you
need to put them back again."

Would it even be sufficient to metafy around the match_str() entry
point, or is the real problem that the *entire* completion system
needs to stop treating the input line as a (char *)?

In which case we almost may as well start over from scratch.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: filename completion with umlauts (again)
  2011-01-09  0:48               ` Bart Schaefer
@ 2011-01-09 16:44                 ` Peter Stephenson
  0 siblings, 0 replies; 4+ messages in thread
From: Peter Stephenson @ 2011-01-09 16:44 UTC (permalink / raw)
  To: zsh-workers

On Sat, 08 Jan 2011 16:48:14 -0800
Bart Schaefer <schaefer@brasslantern.com> wrote:
> Would it even be sufficient to metafy around the match_str() entry
> point, or is the real problem that the *entire* completion system
> needs to stop treating the input line as a (char *)?
> 
> In which case we almost may as well start over from scratch.

I think it is actually relatively localised, unfortunately still quite a
lot of hard-to-understand code.  Most of the completion system uses
proper metafied strings and counts characters approprlately.  It's only
when we get to the matching stage and inserting the result into the
command line that the boundaries blur a bit.  I think apart from quite a
lot of compmatch.c, chunks of compresult.c, and a few bits elsewhere, it
doesn't actually need changing much.  However, one of the problems is
deciding quite where the boundaries are.  A start could probably be made
by introducing appropriate types so that we know where a char * is being
treated as an array of individual characters rather than a generic,
possibly multibyte, string --- I forget this every time I stop looking at it.

-- 
Peter Stephenson <p.w.stephenson@ntlworld.com>
Web page now at http://homepage.ntlworld.com/p.w.stephenson/

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-01-09 16:45 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20110106232712.GA11387@spiegl.de>
     [not found] ` <AANLkTik9unZtuPR-4CM2oKLRT9Soct-XFWmiEajQzbK9@mail.gmail.com>
     [not found]   ` <20110107094419.141d8d67@pwslap01u.europe.root.pri>
     [not found]     ` <20110107233459.GA29168@spiegl.de>
     [not found]       ` <110107231048.ZM919@torch.brasslantern.com>
     [not found]         ` <20110108202122.5decaa0b@pws-pc.ntlworld.com>
2011-01-08 22:22           ` filename completion with umlauts (again) Bart Schaefer
2011-01-08 23:21             ` Peter Stephenson
2011-01-09  0:48               ` Bart Schaefer
2011-01-09 16:44                 ` Peter Stephenson

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).