From mboxrd@z Thu Jan 1 00:00:00 1970 From: erik quanstrom Date: Thu, 2 May 2013 09:44:38 -0400 To: 9fans@9fans.net Message-ID: <37e6edf11c49568bd52ff0f3a9bdbb71@brasstown.quanstro.net> In-Reply-To: <20130502132556.GA2653@polynum.com> References: <20130502123825.GA1975@polynum.com> <20130502132556.GA2653@polynum.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: Re: [9fans] Octets regexp Topicbox-Message-UUID: 5096fbe6-ead8-11e9-9d60-3106f5b1d025 > This is a reflexion made to me by a developer who can use, when > needed, regexp (ed(1) or sed(1)) on an Unix where they still deal > with "char" (bytes) to search for a string of bytes in a binary. i have never needed to do this. could you provide some motiviation for grepping for a wierd byte in an executable? surely the debugger is better suited for this. > And after some thought, I don't see an obvious reason why the regexp > could not be used with bytes strings (so UTF-8 is OK) without trying to > match runes (since not every bytes string is a correct UTF-8 sequence). because it makes things more complicated and probablly worse for the common case, while not providing an new functionality already in other tools. > Corollary: I don't know if there is an UTF-8 sequence that can tell: > stop interpreting as UTF-8, takes "as is" (except every incorrect > sequence, problem being to come back from there: if everything is OK "as > is", what can be interpreted as: "stops raw, restart > UTF-8"---solution: this is on user level, not low level, and this is in > the shell explicitely delimiting chunks, like "'" is the only delimiter, > and every embedded "'" has to be "escaped" by doubling it). i think you've missed the point of making utf-8 *the* character set. it's not sometimes the character set. or only on tuesday. it's always the character set. - erik