From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 15677 invoked from network); 19 May 2008 10:34:41 -0000 X-Spam-Checker-Version: SpamAssassin 3.2.4 (2008-01-01) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-2.4 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.2.4 Received: from news.dotsrc.org (HELO a.mx.sunsite.dk) (130.225.247.88) by ns1.primenet.com.au with SMTP; 19 May 2008 10:34:41 -0000 Received-SPF: none (ns1.primenet.com.au: domain at sunsite.dk does not designate permitted sender hosts) Received: (qmail 4865 invoked from network); 19 May 2008 10:34:34 -0000 Received: from sunsite.dk (130.225.247.90) by a.mx.sunsite.dk with SMTP; 19 May 2008 10:34:34 -0000 Received: (qmail 8154 invoked by alias); 19 May 2008 10:34:31 -0000 Mailing-List: contact zsh-workers-help@sunsite.dk; run by ezmlm Precedence: bulk X-No-Archive: yes X-Seq: 25071 Received: (qmail 8139 invoked from network); 19 May 2008 10:34:30 -0000 Received: from bifrost.dotsrc.org (130.225.254.106) by sunsite.dk with SMTP; 19 May 2008 10:34:30 -0000 Received: from cluster-g.mailcontrol.com (cluster-g.mailcontrol.com [85.115.41.190]) by bifrost.dotsrc.org (Postfix) with ESMTP id 9DC5680589A4 for ; Mon, 19 May 2008 12:34:25 +0200 (CEST) Received: from cameurexb01.EUROPE.ROOT.PRI ([62.189.241.200]) by rly21g.srv.mailcontrol.com (MailControl) with ESMTP id m4JAYMVT007640 for ; Mon, 19 May 2008 11:34:23 +0100 Received: from news01 ([10.103.143.38]) by cameurexb01.EUROPE.ROOT.PRI with Microsoft SMTPSVC(6.0.3790.3959); Mon, 19 May 2008 11:34:21 +0100 Date: Mon, 19 May 2008 11:34:21 +0100 From: Peter Stephenson To: zsh-workers@sunsite.dk (Zsh hackers list) Subject: Re: compmatch behaviour Message-ID: <20080519113421.2ded4a42@news01> In-Reply-To: <080518165753.ZM4385@torch.brasslantern.com> References: <10710.1211137299@pws-pc> <080518165753.ZM4385@torch.brasslantern.com> Organization: CSR X-Mailer: Claws Mail 3.3.1 (GTK+ 2.12.5; i386-redhat-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 19 May 2008 10:34:21.0450 (UTC) FILETIME=[E6465EA0:01C8B99B] X-Scanned-By: MailControl A-08-50-03 (www.mailcontrol.com) on 10.71.0.131 X-Virus-Scanned: ClamAV 0.91.2/7153/Mon May 19 02:25:02 2008 on bifrost X-Virus-Status: Clean On Sun, 18 May 2008 16:57:53 -0700 Bart Schaefer wrote: > So if I comprehend your question, it's not that you need help figuring > out what the code is doing. You want help figuring out what it should > do instead, because the space of all possible wide characters is much > to big to brute-force it the way Sven originally did. > > Right? Exactly. Thanks for looking. > There are two situations being handled > simultaneously here, and maybe the first thing to do is to separate > them. The first situation is where wpat is a correspondence class > and we need to select the corresponding position out of lpat. The > second case is where lpat is an equivalence class and we need to try > every possible character in the class at line position *lp. Hmm... terminology first... Sven's "correspondence class" appears to be the one with the "equiv" flag set, i.e. {...}. So here the characters are numbered and we are searching for a particular one. So currently (no generic character sets) we can do this without looping (conceptually---the implementation needs a loop to find the character but only needs to test that one character). However, in my rewrite I want to be able to say "any upper case character" so that it can match the corresponding lower case character. I can count through the characters or [:stuff:] elements taking account of ranges specially... but then I may just have a pattern such as [:upper:], so I still need to match against the line. So this becomes like the other case. (In case that's not clear: in the existing implementation we have "{a-z}" where a is indexed 1, b is 2, etc. etc. Now I have "{[:upper:]}". Here the only index is 1, but it comes along with a pattern matching instruction "match any upper case character", and on the other side of the correpondence "{[:lower:]}" is interpreted specially when we call pattern_match() as "the corresponding lower case character".) > I think the first think that's needed is to change the Cpattern struct > from a dense array indexed by the ascii value of a character, into ... > well, something else, so that it's not necessary to iterate over it > in the first case, and so the iteration is more sparse in the second > case. Of course, that may make pattern_match() more complicated .... I've already done this, that's the easy bit. The two flies in the ointment are this loop and wide character modifications. > I'm not sure there's any way to avoid iterating in the second case. > Besides handling two possible ways of interpreting the character class, > I think (without tracing very far up the call chain) that this is also > constructing the prefix string that's going to be shown to the user in > the event of a common prefix shared by ambiguous matches. It's not > enough to check whether the test character is in the class, you might > have to know *which* character it is in the class, so to speak. (so you can see how this reflects on my "{[:upper:]}" problem for correspondence classes.) Even if iteration is necessary, which is plausible, I can't help thinking there should be a way of iterating over positions in the word/line and trying to match at each point with the pattern, looking to see which character matched, rather than blindly trying characters at each position, which won't work for every upper case character in Unicode. If we can work out how to change it along those lines then I think we've solved it. (This is certainly a different problem from quadratic behaviour; it's more along the lines of the way the ${foo//.../...} code works, or at least I hope it is because that problem is soluble.) -- Peter Stephenson Software Engineer CSR PLC, Churchill House, Cambridge Business Park, Cowley Road Cambridge, CB4 0WZ, UK Tel: +44 (0)1223 692070