zsh-workers
 help / color / mirror / code / Atom feed
From: Peter Stephenson <pws@csr.com>
To: zsh-workers@sunsite.dk
Subject: Re: Stuff to do
Date: Fri, 29 Sep 2006 18:08:43 +0100	[thread overview]
Message-ID: <20060929180843.3293cffe.pws@csr.com> (raw)
In-Reply-To: <200609292037.17847.arvidjaar@newmail.ru>

Andrey Borzenkov <arvidjaar@newmail.ru> wrote:
> 1. matcher code assumes character == byte and is using 256 bytes array to
> build character equivalence classes. What is worse, it is passing this array
> around between different functions to suppply results of previous matching. I
> have here patch (attached) that eliminates external dependency on this array
> so matcher internals can be more easily changed. This seems to make code a
> bit more understandable irrespectively :) OK to commit?

Yes, the more the calling conventions are sanitized like this the better
I like it.  The references to external data are one of my worst
nightmares.

> 2. Usage of magic array for character classes ([abcd]) can be naturally
> superceded by using either generic pattern matching or direct comparison.
> Pattern matching provides for using something like [[:lower:]] and possibly
> using matchers etc but potential side effects of extended globbing need
> review. I do not know what is faster. Is it OK?

I'd be quite keen on being able to do this by using globbing.  I think the
current uses of matcher specifications are limited enough (sometimes by
necessity, as we're seeing) that an extension wouldn't be a problem for
compatibility; however, I don't know how to mix this with the equivalence
class stuff.  It would be quite nice to keep it in one place in pattern.c,
but I doubt if that's going to work with all the additions we need.

> 3. Equivalence classes ({abcd}={xyzw}) do not scale beyond single byte
> characters. But if we check usage I believe, it has never been used for
> anything beyond case-insensitive matching. For this particular usage I
> suggest using new matcher type:
>
> m:LPAT>upper
> m:LPAT>lower
>
> with obvious semantic - character from line is converted to lower or upper and
> compared with character from potential match. So m:{a-z}={A-Z} becomes
> m:?>upper etc.
>
> We still can implement {...} for character _set_ but not for character range.
> So far I do not consider it major problem.

I think we'll need to keep it working for ASCII for compatibility, but not
extending it to other characters is, as you say, not a big problem.
However, maybe it's not a problem at all; see below.

> 4. The hardest part. Right anchor. For this matcher must match _backward_. I
> am not aware of any way to walk backward as long as we assume arbitrary
> encoding. Options apparently are
>...
> b) convert this code to use wide characters. Not sure if this is a viable
> option.

This is the option I was thinking about, and it removes the range problem
since it extends the ASCII logic in a natural way (it may be system
dependent, but that's the absolute least of our worries).

I don't think it's a problem using wide characters locally for the
comparisons.  Indeed, the pattern match code does all its character class
stuff with wide characters (or kludged wide characters which are just the
unsigned char values if a multibyte sequence doesn't convert).  It doesn't
really make sense to allow for unconvertible characters in matcher
comparisons---it's great to be able to insert them on the command line in
some fashion, but the matcher specs only make sense for characters that are
convertible.

The worst problem is that we lose the ability to do matching control where
(say) much of the string is ASCII, and our match rules only use ASCII, but
there are also characters that don't work in the current locale.  I don't
think this is a big issue and there are possible ways round:
- partial conversion
- convert them at this stage to $'\...' sequences instead of later
- use marked wide characters where we record a byte that can't be converted
--- any of which could be bolted on later.  So I don't think that's a
showstopper.

I was wondering how much of the code we needed to convert to use wide
characters, and vaguely came to the conclusion the only reasonable sane way
was to do it fairly locally within the comparison function(s), since
otherwise the interface to the rest of the completion system gets very
hairy.  However, I haven't actually looked at the code again since
coming to that conclusion.

However, if there's an easy way of doing it by another method, fine.  I
suspect there isn't.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


To access the latest news from CSR copy this link into a web browser:  http://www.csr.com/email_sig.php


  reply	other threads:[~2006-09-29 17:08 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-09-27 12:11 Peter Stephenson
2006-09-27 13:09 ` DervishD
2006-09-29  3:20   ` Bart Schaefer
2006-09-29  8:05     ` DervishD
2006-09-29 16:37 ` Andrey Borzenkov
2006-09-29 17:08   ` Peter Stephenson [this message]
2006-09-29 18:08     ` Andrey Borzenkov
2006-09-29 18:08 ` Andrey Borzenkov
2006-10-08 15:38 ` quest for bld_line (was: Re: Stuff to do) Andrey Borzenkov
2006-10-09 12:00   ` Peter Stephenson
2006-10-09 16:28     ` Andrey Borzenkov
2006-10-11 17:54       ` Andrey Borzenkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060929180843.3293cffe.pws@csr.com \
    --to=pws@csr.com \
    --cc=zsh-workers@sunsite.dk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).