Re: Stuff to do - Andrey Borzenkov

zsh-workers
 help / color / mirror / code / Atom feed

From: Andrey Borzenkov <arvidjaar@newmail.ru>
To: zsh-workers@sunsite.dk
Subject: Re: Stuff to do
Date: Fri, 29 Sep 2006 22:08:05 +0400	[thread overview]
Message-ID: <200609292208.06526.arvidjaar@newmail.ru> (raw)
In-Reply-To: <20060929180843.3293cffe.pws@csr.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Friday 29 September 2006 21:08, Peter Stephenson wrote:
>
> Yes, the more the calling conventions are sanitized like this the better
> I like it.  The references to external data are one of my worst
> nightmares.
>

Ok I committed it.

> > 2. Usage of magic array for character classes ([abcd]) can be naturally
> > superceded by using either generic pattern matching or direct comparison.
> > Pattern matching provides for using something like [[:lower:]] and
> > possibly using matchers etc but potential side effects of extended
> > globbing need review. I do not know what is faster. Is it OK?
>
> I'd be quite keen on being able to do this by using globbing.  I think the
> current uses of matcher specifications are limited enough (sometimes by
> necessity, as we're seeing) that an extension wouldn't be a problem for
> compatibility; however, I don't know how to mix this with the equivalence
> class stuff.  It would be quite nice to keep it in one place in pattern.c,
> but I doubt if that's going to work with all the additions we need.
>
> > 3. Equivalence classes ({abcd}={xyzw}) do not scale beyond single byte
> > characters. But if we check usage I believe, it has never been used for
> > anything beyond case-insensitive matching. For this particular usage I
> > suggest using new matcher type:
> >
> > m:LPAT>upper
> > m:LPAT>lower
> >
> > with obvious semantic - character from line is converted to lower or
> > upper and compared with character from potential match. So m:{a-z}={A-Z}
> > becomes m:?>upper etc.
> >
> > We still can implement {...} for character _set_ but not for character
> > range. So far I do not consider it major problem.
>
> I think we'll need to keep it working for ASCII for compatibility,

yes, that for sure; the idea was actually check if character is < 256 and 
reject pattern otherwise.

> but not 
> extending it to other characters is, as you say, not a big problem.
> However, maybe it's not a problem at all; see below.
>
> > 4. The hardest part. Right anchor. For this matcher must match
> > _backward_. I am not aware of any way to walk backward as long as we
> > assume arbitrary encoding. Options apparently are
> >...
> > b) convert this code to use wide characters. Not sure if this is a viable
> > option.
>
> This is the option I was thinking about, and it removes the range problem
> since it extends the ASCII logic in a natural way (it may be system
> dependent, but that's the absolute least of our worries).
>

It does not, unfortunately. For ranges that is. What you effectively suggest 
is to assume that for {a-z}={A-Z} p matches q if p-a == q-A. This is not 
actually true even for 8 bit EBCDIC (I won't eat my hat on it though :); for 
arbitrary encoding it is simply meaningless. This fails even for basic 
European plane (because ß has no upper counterpart); and as soon as you move 
to next plane (1xx) it stops working completely.

We really need something more portable; the exact syntax is open to discussion 
of course but so far I like m:?>upper :) It is so zsh-ish obscure ...

> I don't think it's a problem using wide characters locally for the
> comparisons.  Indeed, the pattern match code does all its character class
> stuff with wide characters (or kludged wide characters which are just the
> unsigned char values if a multibyte sequence doesn't convert).  It doesn't
> really make sense to allow for unconvertible characters in matcher
> comparisons---it's great to be able to insert them on the command line in
> some fashion, but the matcher specs only make sense for characters that are
> convertible.
>
> The worst problem is that we lose the ability to do matching control where
> (say) much of the string is ASCII, and our match rules only use ASCII, but
> there are also characters that don't work in the current locale.  I don't
> think this is a big issue and there are possible ways round:
> - partial conversion
> - convert them at this stage to $'\...' sequences instead of later

as long as we are able to distinguish between line with real $'...' it is 
fine. Are we?

> - use marked wide characters where we record a byte that can't be converted

That is probably better. This has to be implemented at some point anyway to 
allow input of arbitrary data (vared)

> --- any of which could be bolted on later.  So I don't think that's a
> showstopper.
>
> I was wondering how much of the code we needed to convert to use wide
> characters, and vaguely came to the conclusion the only reasonable sane way
> was to do it fairly locally within the comparison function(s), 

Sure at some point pattern_match will have to convert to wide character to 
test for properties. It does not solve the l[-1] problem; and code is full of 
them.

> since 
> otherwise the interface to the rest of the completion system gets very
> hairy.

Exactly. I tried to convert compamtch.c to wide characters. This itself is 
more or less mechanical task; but it adds insane amount of converting all 
over the other places. This is already slow as is :(

> However, I haven't actually looked at the code again since 
> coming to that conclusion.
>
> However, if there's an easy way of doing it by another method, fine.  I
> suspect there isn't.

it depends on your definition of easy :)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFFHWEGR6LMutpd94wRAqqBAJ0b+XY0Fr24mg0LZpY/CxyMnKZgUwCgzRZE
qirOdaNVPC4pk0wivdvEAvY=
=ocDN
-----END PGP SIGNATURE-----

next prev parent reply	other threads:[~2006-09-29 18:08 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-09-27 12:11 Peter Stephenson
2006-09-27 13:09 ` DervishD
2006-09-29  3:20   ` Bart Schaefer
2006-09-29  8:05     ` DervishD
2006-09-29 16:37 ` Andrey Borzenkov
2006-09-29 17:08   ` Peter Stephenson
2006-09-29 18:08     ` Andrey Borzenkov [this message]
2006-09-29 18:08 ` Andrey Borzenkov
2006-10-08 15:38 ` quest for bld_line (was: Re: Stuff to do) Andrey Borzenkov
2006-10-09 12:00   ` Peter Stephenson
2006-10-09 16:28     ` Andrey Borzenkov
2006-10-11 17:54       ` Andrey Borzenkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200609292208.06526.arvidjaar@newmail.ru \
    --to=arvidjaar@newmail.ru \
    --cc=zsh-workers@sunsite.dk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).