From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from boeygen.nr.no ([156.116.1.2]) by hawkwind.utcs.toronto.edu with SMTP id <2408>; Thu, 3 Jun 1993 07:17:52 -0400 Received: from nr.no by boeygen.nr.no with SMTP (PP) id <05500-0@boeygen.nr.no>; Thu, 3 Jun 1993 13:17:20 +0200 To: rc@hawkwind.utcs.toronto.edu Subject: pattern matching Date: Thu, 3 Jun 1993 07:17:17 -0400 From: Brynjulv Hauksson Message-ID: <"boeygen.nr.513:03.05.93.11.17.27"@nr.no> rc has been my login shell for about a year, and I don't understand how I managed so long using csh. I've read most of the discussions on this list about things users want added (and sometimes removed). I've had my own ideas about this from time to time, and forgotten most of them. There is however one idea I can't quite get rid of, and which I don't think has been mentioned on the list. (I'm not sure if the following qualifies as a serious wish for an rc extension, take it as a suggestion of a possible feature in some future shell if you wish). I think the builtin ~ is very useful, but there are times when I wish it would do more: leave a record somewhere of *what* and *how* the match succeded. This "record" should be a list, with one element for each meta-character or simple string segment that matched, in the order they occurred in the "subject". Example: subject pattern record -Php4 -?* ('-' 'P' 'hp4') I sometimes find myself doing things like: ~ $s pattern && var = `{echo $s | sed 'sed pattern'} This is a bit clumsy: - I need to specify a pattern twice, and in two different notations, since the standard utilities for pattern extraction use regular expressions and not "glob"-notation. - if I really want a list result, this can be surprisingly tricky: var = ``($special){echo $s | sed 'sed pattern using $special as a field separator'} can sometimes be made to work, but I need to be very careful about potential empty strings, and the choice of $special. I wrote an external command called `match' which could be used like: ; eval var '=(' ``(){match string pattern ...} ')' using a slightly modified version of the match-routine in the rc-source (the wanted list result is "almost available" as a side effect of the ~ command, in the sense that rc's match-routine could keep track of the needed information easily and cheaply). If string matches one of the patterns, match succeeds and prints the matching string in a format suitable for eval to standard output. If there is no match, it fails and prints nothing. There are some problems with this - you really need to invent yet another syntax and semantics for "glob"-pattern matching: - inserting literal "metacharacters" in the pattern needs a syntax different from the one rc uses. - you need to decide how to handle, and distinguish, various borderline cases. - matching lists against lists is messy - I still frequently need to specify patterns *twice*, once for for checking if there was a match (using `~') and once for doing the extraction (using `match'). In the end `match' did not turn out to be all that useful, although I still think "glob"-based pattern extraction could be very useful, provided it was built into the shell. Could this facility be grafted on to rc (or an rc-decendant)? Perhaps: 1) pattern matching in switch- and ~-commands could keep track of how they matched, and quietly assign a suitable list to some special variable, say `$**', on a succesful match. I don't think the cost of doing this would be prohibitive, but it would be a feature you'd have to pay for, whether you used it or not. (You could possibly reduce the cost somewhat by doing bookkeeping and assignment only under certain conditions, like if $** was undefined.) Anyway, this sort of "magic" variable and quiet sideeffect has a slightly perl-ish flavour, which I'm not sure if I like. 2) you could add a new operator, say `=~', with syntax and semantics like a cross between rc's `~' and `=': var =~ subject pattern could, if it succeded, assign a list of the result to var. Examples: ; x =~ -abcdef -?* && whatis x x=(- a bcdef) (One could conceivably make the existing ~-operator take an optional prefixed variable name. I haven't looked into what sort of complexities that would introduce into rc's grammar). Some examples of usage (assuming the $** hack): # crude basename fn basename { name = $1 suffix = $2 { while (~ $name */*) name = $**(3) ~ $name * ^ $suffix && name = $**(1) echo $name }} # remove a trailing newline from a string ; ~ $str * ^ $nl && str = $**(1) Potential gain from simple builtin pattern extraction: - most of *my* uses of sed/awk/expr for pattern based string handling would dissapear. I could use the same syntax and matching semantics as rc uses for matching. Simple pattern extraction would become accessible in a very convenient fashion, since I would normally need to specify a pattern *once* only. - you would get a limited builtin stringhandling capability. It seems that most shells, rc included, have builtin facilities for concatenating strings, while picking them apart is much harder. This seems like an asymmetry to me, which a builtin "glob" pattern extraction facility could partially remedy. - brynjulv