9front - general discussion about 9front
 help / color / mirror / Atom feed
* [9front] sed: fix moving '^' match
@ 2023-08-25 21:46 ori
  2023-08-26  0:04 ` [9front] " Anthony Martin
  2023-08-26  9:05 ` [9front] " hiro
  0 siblings, 2 replies; 16+ messages in thread
From: ori @ 2023-08-25 21:46 UTC (permalink / raw)
  To: 9front; +Cc: k0ga

Currently, if you do something like:

	echo aabbccd | sed s/^..//g

it will output simply:

	'd'

the start of line match movnig around as the
replacement progresses is unexpected, and
inconsistent with the way that other sed
implementations behave.

This happens because in our sed, we process
substitutions match by match, applying the
substitution as we go; normally, this is
unobservable, and the substitution looks
atomic, as regexp matches never look back;
the one exception is the '^' operator,
which checks if the current char is at the
start of the string or was a newline.

This patch works by adding a dummy character
at the start of the line, so we aren't at
the start of a line after the first sub.

This patch brings us inline with at least
openbsd and gnu sed, as well as reducing
the amount of surprise I experience when
I put a 'g' in a match out of habit.

Before this patch:

	echo abc | sed s/^.//g => ''
	echo abc | sed s/.$//g => 'ab'

after:

	echo abc | sed s/^.//g => 'bc'
	echo abc | sed s/.$//g => 'ab'

anyone aware of any unexpected side effets
that this may have?


diff 44a2f89a03c370940fa0f4747c2357c73984d653 uncommitted
--- a/sys/src/cmd/sed.c
+++ b/sys/src/cmd/sed.c
@@ -127,9 +127,10 @@
 Rune	*loc2;				/* End of pattern match */
 Rune	seof;				/* Pattern delimiter char */
 
-Rune	linebuf[LBSIZE+1];		/* Input data buffer */
-Rune	*lbend = linebuf+LBSIZE;	/* End of buffer */
-Rune	*spend = linebuf;		/* End of input data */
+Rune	linestor[LBSIZE+1];		/* Input data storage */
+Rune	*linebuf = linestor+1;		/* Input data buffer */
+Rune	*lbend = linestor+LBSIZE;	/* End of buffer */
+Rune	*spend = linestor;		/* End of input data */
 Rune	*cp;				/* Current scan point in linebuf */
 
 Rune	holdsp[LBSIZE+1];		/* Hold buffer */
@@ -187,7 +188,7 @@
 void	fcomp(void);
 long	getrune(void);
 Rune	*gline(Rune *);
-int	match(Reprog *, Rune *);
+int	match(Reprog *, Rune *, int);
 void	newfile(enum PTYPE, char *);
 int 	opendata(void);
 Biobuf	*open_file(char *);
@@ -980,7 +981,7 @@
 			ipc->active = 0;	/* out of range */
 			return ipc->negfl;
 		case A_RE:		/* Check for matching R.E. */
-			if (match(ipc->ad2.rp, linebuf))
+			if (match(ipc->ad2.rp, linebuf, 1))
 				ipc->active = 0;
 			return !ipc->negfl;
 		default:
@@ -1001,7 +1002,7 @@
 		}
 		break;
 	case A_RE:			/* Check R.E. */
-		if (match(ipc->ad1.rp, linebuf)) {
+		if (match(ipc->ad1.rp, linebuf, 1)) {
 			ipc->active = 1;	/* In range */
 			return !ipc->negfl;
 		}
@@ -1013,13 +1014,22 @@
 }
 
 int
-match(Reprog *pattern, Rune *buf)
+match(Reprog *pattern, Rune *buf, int first)
 {
+	Rune *p;
+
 	if (!pattern)
 		return 0;
+	/*
+	 * a regex that replaces the the start of a line
+	 * with an empty string moves the location of a
+	 * '^' match, so we need to insert a dummy char
+	 * when we're not on the first match of a line.
+	 */
+	p = first ? linebuf : linestor;
 	subexp[0].rsp = buf;
 	subexp[0].ep = 0;
-	if (rregexec(pattern, linebuf, subexp, MAXSUB) > 0) {
+	if (rregexec(pattern, p, subexp, MAXSUB) > 0) {
 		loc1 = subexp[0].rsp;
 		loc2 = subexp[0].rep;
 		return 1;
@@ -1033,7 +1043,7 @@
 {
 	int len;
 
-	if(!match(ipc->re1, linebuf))
+	if(!match(ipc->re1, linebuf, 1))
 		return 0;
 
 	/*
@@ -1054,7 +1064,7 @@
 				loc2++;		/* bump over 0-length match */
 			if(*loc2 == 0)		/* end of string */
 				break;
-		} while(match(ipc->re1, loc2));
+		} while(match(ipc->re1, loc2, 0));
 	return 1;
 }
 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [9front] Re: sed: fix moving '^' match
  2023-08-25 21:46 [9front] sed: fix moving '^' match ori
@ 2023-08-26  0:04 ` Anthony Martin
  2023-08-26  9:14   ` hiro
  2023-08-26  9:05 ` [9front] " hiro
  1 sibling, 1 reply; 16+ messages in thread
From: Anthony Martin @ 2023-08-26  0:04 UTC (permalink / raw)
  To: 9front

ori@eigenstate.org once said:
> Currently, if you do something like:
>
> 	echo aabbccd | sed s/^..//g
>
> it will output simply:
>
> 	'd'
>
> [...]
>
> This patch works by adding a dummy character
> at the start of the line, so we aren't at
> the start of a line after the first sub.
>
> [...]
>
> anyone aware of any unexpected side effets
> that this may have?

It looks like a reasonable change but I'd be hesitant to
make it. Ed(1) produces the same output and the behavior
appears to be unchanged since at least the 8th edition.

It's clearly a rare edge case and easily avoided by not using
the g flag when it's not necessary. I doubt it would break
anything, though.

Cheers,
  Anthony

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9front] sed: fix moving '^' match
  2023-08-25 21:46 [9front] sed: fix moving '^' match ori
  2023-08-26  0:04 ` [9front] " Anthony Martin
@ 2023-08-26  9:05 ` hiro
  2023-08-26  9:41   ` ieliedonge
  1 sibling, 1 reply; 16+ messages in thread
From: hiro @ 2023-08-26  9:05 UTC (permalink / raw)
  To: 9front; +Cc: k0ga

> This patch brings us inline with at least
> openbsd and gnu sed,

being in line with gnu sed is an anti-feature.

> as well as reducing
> the amount of surprise I experience when
> I put a 'g' in a match out of habit.

you deserve to be surprised. you should be thankful to be surprised.
think, before you use g.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9front] Re: sed: fix moving '^' match
  2023-08-26  0:04 ` [9front] " Anthony Martin
@ 2023-08-26  9:14   ` hiro
  2023-08-26 17:58     ` ori
  2023-08-26 18:02     ` ori
  0 siblings, 2 replies; 16+ messages in thread
From: hiro @ 2023-08-26  9:14 UTC (permalink / raw)
  To: 9front

> It looks like a reasonable change but I'd be hesitant to
> make it. Ed(1) produces the same output and the behavior
> appears to be unchanged since at least the 8th edition.

thanks for checking the 8th edition also. i was wondering about this.

> It's clearly a rare edge case and easily avoided by not using
> the g flag when it's not necessary. I doubt it would break
> anything, though.

it's not just that it's not necessary.
in ori's example it's conflicting commands opposing what ori is trying
to achieve.
but he's trying to achieve something that is trivial to achieve in
regular ways without conflicts.

meanwhile now you're trying to ADD CODE to remove a behavior that
needn't HURT any valid usecase.
the following stops working, which is WRONG. i.e. bsd is wrong, and
GNU is wrong. while Plan9 ed and sed is right as usual.

echo abababacab | sed 's/^ab//g'

PLEASE stop with this move fast and break things

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9front] sed: fix moving '^' match
  2023-08-26  9:05 ` [9front] " hiro
@ 2023-08-26  9:41   ` ieliedonge
  2023-08-26 16:48     ` hiro
  0 siblings, 1 reply; 16+ messages in thread
From: ieliedonge @ 2023-08-26  9:41 UTC (permalink / raw)
  To: 9front; +Cc: k0ga

> you deserve to be surprised. you should be thankful to be surprised.
> think, before you use g.

Thinking before acting is definitely good.

Current behavior doesn't match what's implied by sed(1) and regexp(6), which
smells like a bug to me. Fixing bugs is also good.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9front] sed: fix moving '^' match
  2023-08-26  9:41   ` ieliedonge
@ 2023-08-26 16:48     ` hiro
  2023-08-26 18:00       ` ori
  2023-08-26 23:23       ` ieliedonge
  0 siblings, 2 replies; 16+ messages in thread
From: hiro @ 2023-08-26 16:48 UTC (permalink / raw)
  To: 9front; +Cc: k0ga

i disagree, i can interprete sed(1) to allow exactly this feature

On Sat, Aug 26, 2023 at 4:11 PM <ieliedonge@wilsonb.com> wrote:
>
> > you deserve to be surprised. you should be thankful to be surprised.
> > think, before you use g.
>
> Thinking before acting is definitely good.
>
> Current behavior doesn't match what's implied by sed(1) and regexp(6), which
> smells like a bug to me. Fixing bugs is also good.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9front] Re: sed: fix moving '^' match
  2023-08-26  9:14   ` hiro
@ 2023-08-26 17:58     ` ori
  2023-08-26 18:02     ` ori
  1 sibling, 0 replies; 16+ messages in thread
From: ori @ 2023-08-26 17:58 UTC (permalink / raw)
  To: 9front

Quoth hiro <23hiro@gmail.com>:
> meanwhile now you're trying to ADD CODE to remove a behavior that
> needn't HURT any valid usecase.
> the following stops working, which is WRONG. i.e. bsd is wrong, and
> GNU is wrong. while Plan9 ed and sed is right as usual.
> 
> echo abababacab | sed 's/^ab//g'
> 
> PLEASE stop with this move fast and break things

can you explain why moving '^' as the line is processed
is right?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9front] sed: fix moving '^' match
  2023-08-26 16:48     ` hiro
@ 2023-08-26 18:00       ` ori
  2023-08-26 23:23       ` ieliedonge
  1 sibling, 0 replies; 16+ messages in thread
From: ori @ 2023-08-26 18:00 UTC (permalink / raw)
  To: 9front; +Cc: k0ga

Quoth hiro <23hiro@gmail.com>:
> i disagree, i can interprete sed(1) to allow exactly this feature

you expect changes to the line to be visible within a single
search/replace operation? can you explain why?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9front] Re: sed: fix moving '^' match
  2023-08-26  9:14   ` hiro
  2023-08-26 17:58     ` ori
@ 2023-08-26 18:02     ` ori
  1 sibling, 0 replies; 16+ messages in thread
From: ori @ 2023-08-26 18:02 UTC (permalink / raw)
  To: 9front

Quoth hiro <23hiro@gmail.com>:
> the following stops working, which is WRONG. i.e. bsd is wrong, and
> GNU is wrong. while Plan9 ed and sed is right as usual.
> 
> echo abababacab | sed 's/^ab//g'

can you explain why this should behave differently from:

	echo bacabababab | sed 's/ab$//g'


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9front] sed: fix moving '^' match
  2023-08-26 16:48     ` hiro
  2023-08-26 18:00       ` ori
@ 2023-08-26 23:23       ` ieliedonge
  2023-08-27  8:33         ` hiro
  1 sibling, 1 reply; 16+ messages in thread
From: ieliedonge @ 2023-08-26 23:23 UTC (permalink / raw)
  To: 9front; +Cc: k0ga

> i disagree, i can interprete sed(1) to allow exactly this feature

You are way more creative than most then. I tend to think that lines only have
a single beginning and that `echo "blah"` writes a single line.

From regexp(6):

    A ^ matches the beginning of a line; $ matches the end of the line.

From sed(1):

    g     Global. Substitute for all non–overlapping instances of the regular expression rather than just the first one.


FWIW, the current 9front behavior is also inconsistent with Unix V7's sed,
which both BSD and GNU implementations are, perhaps ironically, actually more
consistent than us in this particular case:

    $ echo aabbccd | sed 's/^..//g'
    bbccd

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9front] sed: fix moving '^' match
  2023-08-26 23:23       ` ieliedonge
@ 2023-08-27  8:33         ` hiro
  2023-08-27  9:54           ` tlaronde
  0 siblings, 1 reply; 16+ messages in thread
From: hiro @ 2023-08-27  8:33 UTC (permalink / raw)
  To: 9front; +Cc: k0ga

> You are way more creative than most then.

no. this is about parsing the man page carefully.

>I tend to think that lines only have
> a single beginning and that `echo "blah"` writes a single line.

i wouldn't disagree at all.




we must be disagreeing about some fundamental understanding instead.
it's hard for me to see still where *exactly* we differ.

just to make sure. you are aware of the other basic symmetrically
asymmetric problems that one encounters in everyday life with sed,
right? the asymmetry between input and output...


cpu% echo aa | sed 's/aa/a/g'
a
cpu% echo aaaa | sed 's/aa/a/g'
aa
cpu% echo aaaaaa | sed 's/aa/a/g'
aaa
cpu% echo aaaaaaaa | sed 's/aa/a/g'
aaaa
cpu% echo aaaaaaaaaa | sed 's/aa/a/g'
aaaaa


is this a basic known fact here?!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9front] sed: fix moving '^' match
  2023-08-27  8:33         ` hiro
@ 2023-08-27  9:54           ` tlaronde
  2023-08-27 11:19             ` ieliedonge
  2023-09-23  1:35             ` ieliedonge
  0 siblings, 2 replies; 16+ messages in thread
From: tlaronde @ 2023-08-27  9:54 UTC (permalink / raw)
  To: 9front

On Sun, Aug 27, 2023 at 10:33:50AM +0200, hiro wrote:
> 
> we must be disagreeing about some fundamental understanding instead.
> it's hard for me to see still where *exactly* we differ.
> 
> just to make sure. you are aware of the other basic symmetrically
> asymmetric problems that one encounters in everyday life with sed,
> right? the asymmetry between input and output...
> 
> 
> cpu% echo aa | sed 's/aa/a/g'
> a
> cpu% echo aaaa | sed 's/aa/a/g'
> aa

Isn't this exactly a counter-example to your interpretation of the
behavior of '^aa' with "sliding" beginning (^ anchor) of the line?

To be consistent with what you interpret, "echo aaaa | sed 's/aa/a/g'"
should yield: "a"; and, in fact, all strings of more than one "a" should
give "a", since the "new" string, retaken from its new beginning, should
be reparsed.

I think the implicit understanding of 's/^..//g' is that once the first
two characters are discarded, the scanning has advanced in the original
string and hence it is not anymore at the beginning.

Ori example of the lack of symetry of behavior with 's/aa$//g' is to the
point too.

In some sense, this interpretation of 'g' is considering that 'g' makes an
infinite looping of the regex, applied to the resulting string and that should
only end if no modification is made. Seems incorrect to me.

The current behavior may be bug compatible with V8. It remains a bug
IMHO.
-- 
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                     http://www.kergis.com/
                    http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9front] sed: fix moving '^' match
  2023-08-27  9:54           ` tlaronde
@ 2023-08-27 11:19             ` ieliedonge
  2023-09-23  1:35             ` ieliedonge
  1 sibling, 0 replies; 16+ messages in thread
From: ieliedonge @ 2023-08-27 11:19 UTC (permalink / raw)
  To: 9front

> > we must be disagreeing about some fundamental understanding instead.
> > it's hard for me to see still where *exactly* we differ.
>
> ...
>
> In some sense, this interpretation of 'g' is considering that 'g' makes an
> infinite looping of the regex, applied to the resulting string and that should
> only end if no modification is made. Seems incorrect to me.

Oh, nice! I believe you lasered in on the crux here. IIUC, hiro's mental model
seems to be that s///g induces a simple loop over the string where '^' can
match the beginning of string multiple times, while mine (and others'?) is that
g identifies all matches *before replacement*.

I'm not sure what *should* be in this case, but the latter interpretation is
definitely consistent with the 5 or so different regex implementations I
regularly interact with.

> The current behavior may be bug compatible with V8. It remains a bug
> IMHO.

Just gave V8 a spin:

    # echo aabbccd | sed 's/^..//g'
    bbccd

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9front] sed: fix moving '^' match
  2023-08-27  9:54           ` tlaronde
  2023-08-27 11:19             ` ieliedonge
@ 2023-09-23  1:35             ` ieliedonge
  2023-09-23 10:30               ` hiro
  2023-09-24 15:35               ` ori
  1 sibling, 2 replies; 16+ messages in thread
From: ieliedonge @ 2023-09-23  1:35 UTC (permalink / raw)
  To: 9front

Any update on this?

I believe we confirmed that the patch brings us back into line with V7 and V8,
not to mention every other regex library in common usage. IMHO, merging makes
eminent sense.

Maybe the documentation for ^ should also be updated? I completely agree that
hiro's interpretation is a valid reading.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9front] sed: fix moving '^' match
  2023-09-23  1:35             ` ieliedonge
@ 2023-09-23 10:30               ` hiro
  2023-09-24 15:35               ` ori
  1 sibling, 0 replies; 16+ messages in thread
From: hiro @ 2023-09-23 10:30 UTC (permalink / raw)
  To: 9front

> I believe we confirmed that the patch brings us back into line with V7 and
> V8,

did we? either way, there's no value in that.

> not to mention every other regex library in common usage.

there's even less value in that.

> IMHO, merging
> makes
> eminent sense.

explain why it's eminent please.


>
> Maybe the documentation for ^ should also be updated? I completely agree
> that
> hiro's interpretation is a valid reading.
>

some rhetorical questions (sad that i have to put a disclaimer about it):

1) will any of these changes increase real-world compatibility? can
you link to the code that would become executable on plan9 as a
result?

2) if not, will this improve any other real world program running on plan9?

3) if 1) or 2) is true, what is the cost of this change?

1)
no

2)
no

3)
it's a non-zero cost

and the benefit: zero.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9front] sed: fix moving '^' match
  2023-09-23  1:35             ` ieliedonge
  2023-09-23 10:30               ` hiro
@ 2023-09-24 15:35               ` ori
  1 sibling, 0 replies; 16+ messages in thread
From: ori @ 2023-09-24 15:35 UTC (permalink / raw)
  To: 9front

Quoth ieliedonge@wilsonb.com:
> Any update on this?
> 
> I believe we confirmed that the patch brings us back into line with V7 and V8,
> not to mention every other regex library in common usage. IMHO, merging makes
> eminent sense.

I'd probably prefer to do a new revision without the hack; generating into
an output buffer would fix the bug, but would also improve performance on
long lines and make the code easier to read (at the cost of some increased
memory use).

Either way, it's not urgent.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-09-24 15:42 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-25 21:46 [9front] sed: fix moving '^' match ori
2023-08-26  0:04 ` [9front] " Anthony Martin
2023-08-26  9:14   ` hiro
2023-08-26 17:58     ` ori
2023-08-26 18:02     ` ori
2023-08-26  9:05 ` [9front] " hiro
2023-08-26  9:41   ` ieliedonge
2023-08-26 16:48     ` hiro
2023-08-26 18:00       ` ori
2023-08-26 23:23       ` ieliedonge
2023-08-27  8:33         ` hiro
2023-08-27  9:54           ` tlaronde
2023-08-27 11:19             ` ieliedonge
2023-09-23  1:35             ` ieliedonge
2023-09-23 10:30               ` hiro
2023-09-24 15:35               ` ori

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).