From mboxrd@z Thu Jan  1 00:00:00 1970
From: erik quanstrom <quanstro@quanstro.net>
Date: Thu,  2 May 2013 08:48:06 -0400
To: 9fans@9fans.net
Message-ID: <f0b15ae0cf8846283eb0f5e513ef8684@kw.quanstro.net>
In-Reply-To: <20130502123825.GA1975@polynum.com>
References: <20130502123825.GA1975@polynum.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
Subject: Re: [9fans] Octets regexp
Topicbox-Message-UUID: 505bc13e-ead8-11e9-9d60-3106f5b1d025

> Regexp(6) handles "characters" that are runes.

perhaps the man page is misleading.  rune in this context means utf-8.
see regexp(2).  all the functions take char*s.

> I wonder if Plan9 developers, when trying to design a way towards some
> localization, have ever thought of bytes (octets) regexp, that is using
> regexp with not rune but octets strings (maybe UTF-8 as is) allowing to
> use regexp with binary too, not only newline terminated chunks etc.?

one of the points of plan 9 was to standardize on one character set,
utf-8.  imho, localization and character set aren't related unless one
is dealing with 8859-x overlays or some other character set insufficient
to represent the range of languages.

however, sam and acme allow for structured regular expressions,
and are generally not line oriented:

http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf

and iirc, cinap has written a cifs bit that uses a bit of binary matching.

- erik