From mboxrd@z Thu Jan  1 00:00:00 1970
From: erik quanstrom <quanstro@quanstro.net>
Date: Thu,  2 May 2013 09:44:38 -0400
To: 9fans@9fans.net
Message-ID: <37e6edf11c49568bd52ff0f3a9bdbb71@brasstown.quanstro.net>
In-Reply-To: <20130502132556.GA2653@polynum.com>
References: <20130502123825.GA1975@polynum.com>
	<f0b15ae0cf8846283eb0f5e513ef8684@kw.quanstro.net>
	<20130502132556.GA2653@polynum.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
Subject: Re: [9fans] Octets regexp
Topicbox-Message-UUID: 5096fbe6-ead8-11e9-9d60-3106f5b1d025

> This is a reflexion made to me by a developer who can use, when
> needed, regexp (ed(1) or sed(1)) on an Unix where they still deal
> with "char" (bytes) to search for a string of bytes in a binary.

i have never needed to do this.  could you provide some motiviation
for grepping for a wierd byte in an executable?  surely the debugger
is better suited for this.

> And after some thought, I don't see an obvious reason why the regexp
> could not be used with bytes strings (so UTF-8 is OK) without trying to
> match runes (since not every bytes string is a correct UTF-8 sequence).

because it makes things more complicated and probablly worse for the
common case, while not providing an new functionality already in
other tools.

> Corollary: I don't know if there is an UTF-8 sequence that can tell:
> stop interpreting as UTF-8, takes "as is" (except every incorrect
> sequence, problem being to come back from there: if everything is OK "as
> is", what can be interpreted as: "stops raw, restart
> UTF-8"---solution: this is on user level, not low level, and this is in
> the shell explicitely delimiting chunks, like "'" is the only delimiter,
> and every embedded "'" has to be "escaped" by doubling it).

i think you've missed the point of making utf-8 *the* character set.
it's not sometimes the character set.  or only on tuesday.  it's always
the character set.

- erik