From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 21460 invoked from network); 6 Aug 1999 16:19:13 -0000 Received: from sunsite.auc.dk (130.225.51.30) by ns1.primenet.com.au with SMTP; 6 Aug 1999 16:19:13 -0000 Received: (qmail 7493 invoked by alias); 6 Aug 1999 16:19:03 -0000 Mailing-List: contact zsh-workers-help@sunsite.auc.dk; run by ezmlm Precedence: bulk X-No-Archive: yes X-Seq: 7389 Received: (qmail 7485 invoked from network); 6 Aug 1999 16:19:02 -0000 Message-Id: <9908061547.AA37737@ibmth.df.unipi.it> To: zsh-workers@sunsite.auc.dk (Zsh hackers list) Subject: Un-patch: new pattern matching code Date: Fri, 06 Aug 1999 17:47:59 +0200 From: Peter Stephenson I'm just about to make 3.1.6-pws-1 available, and it will contain the new pattern matching code (or a new pattern matching code, if it turns out to be duff we can go back). Since it's quite long I haven't posted it separately, but here is a description. The user-visible changes are supposed to be entirely minimal: the precedence of operators should be more rational, backtracking will work in numeric ranges, and that's pretty much it. I intend to document a few more of these in the manual; I don't think anything there at the moment has become wrong. One think works, which used to in the old days but I inadvertently messed it up because I didn't know about it: patterns like foo~bar~rod are valid, and are equivalent to, but less cumbersome than, foo~(bar|rod). It's based on a real regular expression matcher, so it should be significantly faster, although I haven't done any real testing of speed and I would be glad to hear about any comparisons (or maybe I wouldn't). In particular, it should do much less memory management --- in some cases, such as simple patterns in ${...#...} or [[ ... = ... ]], possibly even none, as there is now a static buffer. It is also possible to copy compiled patterns; just use memcpy and the size member of the struct. Note that the default is to duplicate the static buffer into a heap-allocated chunk of the appropriate size; this is needed for file paths. It now lives in its own file, pattern.c. Originally everything here was going to use unsigned char * to avoid bad comparisons with integers, but then I found that all the tokens are defined in zsh.h as ((char) 0x83) and so on, so that went out of the window and I resorted to char * after updating the whole of the shell to pass unsigned char * down. As you can imagine, that was great fun. Anyway, it now uses the usual STOUC(...) hack, so problems are quite possible if I've missed any. However, it's now been compiled and tested on one machine with unsigned and one with signed characters, so the basics are probably right. There are two cpp definitions available: one is ZSH_PAT_DEBUG, which enables the `patdebug' builtin, which prints out a compiled pattern with option -p and with more than one argument tests the pattern against the strings; this was inherited from Henry Spencer's code, but I thought I might as well leave it. There's no configure code to enable ZSH_PAT_DEBUG, just add it by hand. The other cpp definition is BACKREFERENCES, which enables backreferencing of parentheses in patterns like sed and perl do. I haven't enabled it because I don't know what, if anything, I'm going to do with it. Matching is a little more efficient without because the extra recursive call on open and close parentheses isn't there. (There's code not to use backreferences even if they are compiled in, so that doesn't need to be a performance issue for ordinary globbing.) Anyone who knows any computer science will probably notice that, while it's pretty much a finite state machine, some states are more finite than others. In particularly, backtracking on excluded matches, where in (foo~bar)rod the `bar' only has to match the `foo' and the `rod' is irrelevant, need some extra state recording, and I have retained the old trick that eliminates infinite loops and prevents exponential behaviour when failing to match against things like `(f#o#)#' (the version of perl here goes into an infinite loop when given `(f*o*)*', so it isn't trivial). This has meant leaving holes for pointers in the compiled expression, which ought to work (I've tried to make sure everything is aligned to pointer size anyway) but is probably the most dubious part of the proceedings. Finally, the usual warning: in case you hadn't guessed, bugs here can cause problems with pretty much any part of the shell. Some of the tests to make sure things like globbing flags are saved and restored are a bit complex. Alignment problems should have been handled, but there could be some remaining, particularly on 64-bit architectures which I haven't tried. So if you're planning on doing any real work while I'm away, you might want to stick with 3.1.6. -- Peter Stephenson Tel: +39 050 844536 WWW: http://www.ifh.de/~pws/ Dipartimento di Fisica, Via Buonarroti 2, 56127 Pisa, Italy