Re: [9fans] Octets regexp

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

From: tlaronde@polynum.com
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
Subject: Re: [9fans] Octets regexp
Date: Thu,  2 May 2013 15:25:56 +0200	[thread overview]
Message-ID: <20130502132556.GA2653@polynum.com> (raw)
In-Reply-To: <f0b15ae0cf8846283eb0f5e513ef8684@kw.quanstro.net>

On Thu, May 02, 2013 at 08:48:06AM -0400, erik quanstrom wrote:
> > Regexp(6) handles "characters" that are runes.
>
> perhaps the man page is misleading.  rune in this context means utf-8.
> see regexp(2).  all the functions take char*s.

But the source files deal with runes...

>
> one of the points of plan 9 was to standardize on one character set,
> utf-8.  imho, localization and character set aren't related unless one
> is dealing with 8859-x overlays or some other character set insufficient
> to represent the range of languages.
>

Localization (as "handled" in POSIX for example) is a mess. So the Plan9
solution, with still octets (UTF-8) makes far more sense, since it
allows to extend, for the user, the "characters" that can be used in
naming computer objects, but this is just for nicknames: the system
still speaks C/9P.

So it is better, except perhaps for one thing: for me, the system
"speaks" C or even, obviously, "Plan9" (well: 9P). It does not have
to speak french, hebrew, etc. or even english! So it takes or gives
bytes, and this is good.  But the UTF-8 encoding is the main convention
for user interface, but can it be unset? I mean, can one use a
"raw" window, putting uninterpreted bytes, and rendering bytes (with
a special "ASCII" font with whether ASCII + "0xdd" glyphes or whatever,
using fonts to do what is done with vis(1) on Unices or od(1)/xd(1))
and do not impose the assumption that the octet strings is UTF-8? Can
one make a file entering bytes---i.e. binary values that yield
incorrect UTF-8 sequences?

This is a reflexion made to me by a developer who can use, when
needed, regexp (ed(1) or sed(1)) on an Unix where they still deal
with "char" (bytes) to search for a string of bytes in a binary.

And after some thought, I don't see an obvious reason why the regexp
could not be used with bytes strings (so UTF-8 is OK) without trying to
match runes (since not every bytes string is a correct UTF-8 sequence).

Corollary: I don't know if there is an UTF-8 sequence that can tell:
stop interpreting as UTF-8, takes "as is" (except every incorrect
sequence, problem being to come back from there: if everything is OK "as
is", what can be interpreted as: "stops raw, restart
UTF-8"---solution: this is on user level, not low level, and this is in
the shell explicitely delimiting chunks, like "'" is the only delimiter,
and every embedded "'" has to be "escaped" by doubling it).
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

next prev parent reply	other threads:[~2013-05-02 13:25 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-02 12:38 tlaronde
2013-05-02 12:48 ` erik quanstrom
2013-05-02 13:25   ` tlaronde [this message]
2013-05-02 13:43     ` Tristan
2013-05-02 14:19       ` Tristan
2013-05-02 14:51       ` tlaronde
2013-05-02 15:02         ` Bence Fábián
2013-05-02 15:20           ` tlaronde
2013-05-02 15:27             ` erik quanstrom
2013-05-02 15:10         ` Kurt H Maier
2013-05-02 15:21           ` tlaronde
2013-05-02 13:44     ` erik quanstrom
2013-05-02 14:43       ` tlaronde
2013-05-02 14:58     ` a
2013-05-02 15:08       ` tlaronde
2013-05-02 15:19         ` erik quanstrom
2013-05-02 15:31           ` tlaronde
2013-05-02 16:53             ` erik quanstrom
2013-05-02 18:59               ` tlaronde
2013-05-02 18:45           ` dexen deVries
2013-05-02 19:04             ` tlaronde
2013-05-02 19:22               ` erik quanstrom
2013-05-02 19:39                 ` tlaronde
2013-05-02 20:13                   ` erik quanstrom
2013-05-02 20:17                   ` 9p-st
2013-05-03 11:16                     ` tlaronde
2013-05-03 13:15                       ` Tristan
2013-05-03 16:33                         ` tlaronde
2013-05-02 16:16 ` tlaronde

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130502132556.GA2653@polynum.com \
    --to=tlaronde@polynum.com \
    --cc=9fans@9fans.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).