From mboxrd@z Thu Jan  1 00:00:00 1970
Date: Thu,  2 May 2013 16:51:39 +0200
From: tlaronde@polynum.com
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
Message-ID: <20130502145139.GB438@polynum.com>
References: <20130502132556.GA2653@polynum.com>
	<257b5f.1257db09.089d.mx@tumtum.plumbweb.net>
Mime-Version: 1.0
In-Reply-To: <257b5f.1257db09.089d.mx@tumtum.plumbweb.net>
User-Agent: Mutt/1.4.2.3i
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
Subject: Re: [9fans] Octets regexp
Topicbox-Message-UUID: 50b1bc88-ead8-11e9-9d60-3106f5b1d025

On Thu, May 02, 2013 at 09:43:10AM -0400, Tristan wrote:
> > And after some thought, I don't see an obvious reason why the regexp
> > could not be used with bytes strings (so UTF-8 is OK) without trying =
to
> > match runes (since not every bytes string is a correct UTF-8 sequence=
).
>=20
> with octet based regexps, [=DE=FE] doesn't match =FE, but 0xc3, 0xbe an=
d 0x9e
> independantly.
>=20

Regexp knows subexpressions. So it could be achieved, and one could even
have the present functions be higher level ones, calling more basic ones
dealing with bytes (a rune specified by an UTF-8 sequence being replaced
by a subexpression) or even dealing with various sizes of element
(character; but one fixed size for the processing).

Or even a specification =E0 la C: by adding a leading 'L' meaning:
treat the string as UTF-8 that is masters runes. And if not, leave
it alone.

--=20
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint =3D 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C