From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 16757 invoked from network); 29 Sep 2006 18:08:20 -0000 X-Spam-Checker-Version: SpamAssassin 3.1.5 (2006-08-29) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00,DRUGS_MUSCLE, FORGED_RCVD_HELO autolearn=ham version=3.1.5 Received: from news.dotsrc.org (HELO a.mx.sunsite.dk) (130.225.247.88) by ns1.primenet.com.au with SMTP; 29 Sep 2006 18:08:20 -0000 Received-SPF: none (ns1.primenet.com.au: domain at sunsite.dk does not designate permitted sender hosts) Received: (qmail 97785 invoked from network); 29 Sep 2006 18:08:14 -0000 Received: from sunsite.dk (130.225.247.90) by a.mx.sunsite.dk with SMTP; 29 Sep 2006 18:08:14 -0000 Received: (qmail 22761 invoked by alias); 29 Sep 2006 18:08:10 -0000 Mailing-List: contact zsh-workers-help@sunsite.dk; run by ezmlm Precedence: bulk X-No-Archive: yes X-Seq: 22790 Received: (qmail 22751 invoked from network); 29 Sep 2006 18:08:10 -0000 Received: from news.dotsrc.org (HELO a.mx.sunsite.dk) (130.225.247.88) by sunsite.dk with SMTP; 29 Sep 2006 18:08:09 -0000 Received: (qmail 97445 invoked from network); 29 Sep 2006 18:08:09 -0000 Received: from flock1.newmail.ru (80.68.241.157) by a.mx.sunsite.dk with SMTP; 29 Sep 2006 18:08:07 -0000 Received: (qmail 7418 invoked from network); 29 Sep 2006 18:08:07 -0000 Received: from unknown (HELO cooker.local) (arvidjaar@newmail.ru@85.141.135.179) by smtpd.newmail.ru with SMTP; 29 Sep 2006 18:08:07 -0000 From: Andrey Borzenkov To: zsh-workers@sunsite.dk Subject: Re: Stuff to do Date: Fri, 29 Sep 2006 22:08:05 +0400 User-Agent: KMail/1.9.4 References: <200609271211.k8RCBW5N023914@news01.csr.com> <200609292037.17847.arvidjaar@newmail.ru> <20060929180843.3293cffe.pws@csr.com> In-Reply-To: <20060929180843.3293cffe.pws@csr.com> Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200609292208.06526.arvidjaar@newmail.ru> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Friday 29 September 2006 21:08, Peter Stephenson wrote: > > Yes, the more the calling conventions are sanitized like this the better > I like it. The references to external data are one of my worst > nightmares. > Ok I committed it. > > 2. Usage of magic array for character classes ([abcd]) can be naturally > > superceded by using either generic pattern matching or direct comparison. > > Pattern matching provides for using something like [[:lower:]] and > > possibly using matchers etc but potential side effects of extended > > globbing need review. I do not know what is faster. Is it OK? > > I'd be quite keen on being able to do this by using globbing. I think the > current uses of matcher specifications are limited enough (sometimes by > necessity, as we're seeing) that an extension wouldn't be a problem for > compatibility; however, I don't know how to mix this with the equivalence > class stuff. It would be quite nice to keep it in one place in pattern.c, > but I doubt if that's going to work with all the additions we need. > > > 3. Equivalence classes ({abcd}={xyzw}) do not scale beyond single byte > > characters. But if we check usage I believe, it has never been used for > > anything beyond case-insensitive matching. For this particular usage I > > suggest using new matcher type: > > > > m:LPAT>upper > > m:LPAT>lower > > > > with obvious semantic - character from line is converted to lower or > > upper and compared with character from potential match. So m:{a-z}={A-Z} > > becomes m:?>upper etc. > > > > We still can implement {...} for character _set_ but not for character > > range. So far I do not consider it major problem. > > I think we'll need to keep it working for ASCII for compatibility, yes, that for sure; the idea was actually check if character is < 256 and reject pattern otherwise. > but not > extending it to other characters is, as you say, not a big problem. > However, maybe it's not a problem at all; see below. > > > 4. The hardest part. Right anchor. For this matcher must match > > _backward_. I am not aware of any way to walk backward as long as we > > assume arbitrary encoding. Options apparently are > >... > > b) convert this code to use wide characters. Not sure if this is a viable > > option. > > This is the option I was thinking about, and it removes the range problem > since it extends the ASCII logic in a natural way (it may be system > dependent, but that's the absolute least of our worries). > It does not, unfortunately. For ranges that is. What you effectively suggest is to assume that for {a-z}={A-Z} p matches q if p-a == q-A. This is not actually true even for 8 bit EBCDIC (I won't eat my hat on it though :); for arbitrary encoding it is simply meaningless. This fails even for basic European plane (because ß has no upper counterpart); and as soon as you move to next plane (1xx) it stops working completely. We really need something more portable; the exact syntax is open to discussion of course but so far I like m:?>upper :) It is so zsh-ish obscure ... > I don't think it's a problem using wide characters locally for the > comparisons. Indeed, the pattern match code does all its character class > stuff with wide characters (or kludged wide characters which are just the > unsigned char values if a multibyte sequence doesn't convert). It doesn't > really make sense to allow for unconvertible characters in matcher > comparisons---it's great to be able to insert them on the command line in > some fashion, but the matcher specs only make sense for characters that are > convertible. > > The worst problem is that we lose the ability to do matching control where > (say) much of the string is ASCII, and our match rules only use ASCII, but > there are also characters that don't work in the current locale. I don't > think this is a big issue and there are possible ways round: > - partial conversion > - convert them at this stage to $'\...' sequences instead of later as long as we are able to distinguish between line with real $'...' it is fine. Are we? > - use marked wide characters where we record a byte that can't be converted That is probably better. This has to be implemented at some point anyway to allow input of arbitrary data (vared) > --- any of which could be bolted on later. So I don't think that's a > showstopper. > > I was wondering how much of the code we needed to convert to use wide > characters, and vaguely came to the conclusion the only reasonable sane way > was to do it fairly locally within the comparison function(s), Sure at some point pattern_match will have to convert to wide character to test for properties. It does not solve the l[-1] problem; and code is full of them. > since > otherwise the interface to the rest of the completion system gets very > hairy. Exactly. I tried to convert compamtch.c to wide characters. This itself is more or less mechanical task; but it adds insane amount of converting all over the other places. This is already slow as is :( > However, I haven't actually looked at the code again since > coming to that conclusion. > > However, if there's an easy way of doing it by another method, fine. I > suspect there isn't. it depends on your definition of easy :) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD8DBQFFHWEGR6LMutpd94wRAqqBAJ0b+XY0Fr24mg0LZpY/CxyMnKZgUwCgzRZE qirOdaNVPC4pk0wivdvEAvY= =ocDN -----END PGP SIGNATURE-----