From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 2824 invoked from network); 25 Aug 2023 21:47:45 -0000 Received: from 9front.inri.net (168.235.81.73) by inbox.vuxu.org with ESMTPUTF8; 25 Aug 2023 21:47:45 -0000 Received: from mimir.eigenstate.org ([206.124.132.107]) by 9front; Fri Aug 25 17:46:06 -0400 2023 Received: from abbatoir (pool-108-27-53-161.nycmny.fios.verizon.net [108.27.53.161]) by mimir.eigenstate.org (OpenSMTPD) with ESMTPSA id 400d51ce (TLSv1.2:ECDHE-RSA-AES256-SHA:256:NO); Fri, 25 Aug 2023 14:46:03 -0700 (PDT) Message-ID: <26E0B9AD31488853D40C701E19370A2D@eigenstate.org> To: 9front@9front.org CC: k0ga@shike2.com Date: Fri, 25 Aug 2023 17:46:02 -0400 From: ori@eigenstate.org MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit List-ID: <9front.9front.org> List-Help: X-Glyph: ➈ X-Bullshit: deep-learning core-aware software Subject: [9front] sed: fix moving '^' match Reply-To: 9front@9front.org Precedence: bulk Currently, if you do something like: echo aabbccd | sed s/^..//g it will output simply: 'd' the start of line match movnig around as the replacement progresses is unexpected, and inconsistent with the way that other sed implementations behave. This happens because in our sed, we process substitutions match by match, applying the substitution as we go; normally, this is unobservable, and the substitution looks atomic, as regexp matches never look back; the one exception is the '^' operator, which checks if the current char is at the start of the string or was a newline. This patch works by adding a dummy character at the start of the line, so we aren't at the start of a line after the first sub. This patch brings us inline with at least openbsd and gnu sed, as well as reducing the amount of surprise I experience when I put a 'g' in a match out of habit. Before this patch: echo abc | sed s/^.//g => '' echo abc | sed s/.$//g => 'ab' after: echo abc | sed s/^.//g => 'bc' echo abc | sed s/.$//g => 'ab' anyone aware of any unexpected side effets that this may have? diff 44a2f89a03c370940fa0f4747c2357c73984d653 uncommitted --- a/sys/src/cmd/sed.c +++ b/sys/src/cmd/sed.c @@ -127,9 +127,10 @@ Rune *loc2; /* End of pattern match */ Rune seof; /* Pattern delimiter char */ -Rune linebuf[LBSIZE+1]; /* Input data buffer */ -Rune *lbend = linebuf+LBSIZE; /* End of buffer */ -Rune *spend = linebuf; /* End of input data */ +Rune linestor[LBSIZE+1]; /* Input data storage */ +Rune *linebuf = linestor+1; /* Input data buffer */ +Rune *lbend = linestor+LBSIZE; /* End of buffer */ +Rune *spend = linestor; /* End of input data */ Rune *cp; /* Current scan point in linebuf */ Rune holdsp[LBSIZE+1]; /* Hold buffer */ @@ -187,7 +188,7 @@ void fcomp(void); long getrune(void); Rune *gline(Rune *); -int match(Reprog *, Rune *); +int match(Reprog *, Rune *, int); void newfile(enum PTYPE, char *); int opendata(void); Biobuf *open_file(char *); @@ -980,7 +981,7 @@ ipc->active = 0; /* out of range */ return ipc->negfl; case A_RE: /* Check for matching R.E. */ - if (match(ipc->ad2.rp, linebuf)) + if (match(ipc->ad2.rp, linebuf, 1)) ipc->active = 0; return !ipc->negfl; default: @@ -1001,7 +1002,7 @@ } break; case A_RE: /* Check R.E. */ - if (match(ipc->ad1.rp, linebuf)) { + if (match(ipc->ad1.rp, linebuf, 1)) { ipc->active = 1; /* In range */ return !ipc->negfl; } @@ -1013,13 +1014,22 @@ } int -match(Reprog *pattern, Rune *buf) +match(Reprog *pattern, Rune *buf, int first) { + Rune *p; + if (!pattern) return 0; + /* + * a regex that replaces the the start of a line + * with an empty string moves the location of a + * '^' match, so we need to insert a dummy char + * when we're not on the first match of a line. + */ + p = first ? linebuf : linestor; subexp[0].rsp = buf; subexp[0].ep = 0; - if (rregexec(pattern, linebuf, subexp, MAXSUB) > 0) { + if (rregexec(pattern, p, subexp, MAXSUB) > 0) { loc1 = subexp[0].rsp; loc2 = subexp[0].rep; return 1; @@ -1033,7 +1043,7 @@ { int len; - if(!match(ipc->re1, linebuf)) + if(!match(ipc->re1, linebuf, 1)) return 0; /* @@ -1054,7 +1064,7 @@ loc2++; /* bump over 0-length match */ if(*loc2 == 0) /* end of string */ break; - } while(match(ipc->re1, loc2)); + } while(match(ipc->re1, loc2, 0)); return 1; }