From mboxrd@z Thu Jan 1 00:00:00 1970 To: 9fans@cse.psu.edu Subject: Re: [9fans] regexp to match paragraphs in troff documents From: "Russ Cox" Date: Thu, 7 Jun 2007 13:46:15 -0400 In-Reply-To: <082ffdbcfda12b51637105e95ea479c8@mteege.de> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Message-Id: <20070607174226.379901E8C1C@holo.morphisms.net> Topicbox-Message-UUID: 7aa1138c-ead2-11e9-9d60-3106f5b1d025 matthias: > I've tried to pipe paragraphs of a troff document to fmt but I > have problems with the correct regular expression. My first attempt > "/^\.[A-Z]++.*\n(^[^]*)*\n^\.[A-Z]++.*\n/" matches only any second > paragraph because the expression is overlapping. Does anyone have a nice > idea to match troff parapgraphs? It looks like you'd be happy with ,y/^\..*\n/ |fmt or, avoiding fmt of zero-length ranges (just a little faster) ,y/(^\..*\n)+/ |fmt rog: > for other cases, i suppose it might be nice to have non-greedy matching, > in which case you could do something like: > ,x/^\.[A-Z][A-Z].*\n(.*\n)*?\.[A-Z][A-Z]\n/ > russ: how easy do you think it would be to put non-greedy matching into > the acme/sam regexp engine? it's trivial but it doesn't make sense. in plan 9 regular expressions (as in awk), the semantics are that the leftmost longest overall match is chosen, even if that means not repeating a * as much as possible. for example, consider /a*(ab)?/ against "aab". perl will match "aa" because the a* greedily grabs "aa" leaving (ab)? no choice but to match the empty string. plan 9 will match "aab" because that is a longer match: the a* selflessly matches less so that the overall expression can match more. since plan 9 doesn't have the greedy-like-perl * operator, it doesn't make sense to think about adding a non-greedy-like-perl * operator. the y iterator handles about 90% of the reasons people use non-greedy operators, so i'm happy to leave things as is. russ