9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] Octets regexp
@ 2013-05-02 12:38 tlaronde
  2013-05-02 12:48 ` erik quanstrom
  2013-05-02 16:16 ` tlaronde
  0 siblings, 2 replies; 29+ messages in thread
From: tlaronde @ 2013-05-02 12:38 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Regexp(6) handles "characters" that are runes.

I wonder if Plan9 developers, when trying to design a way towards some
localization, have ever thought of bytes (octets) regexp, that is using
regexp with not rune but octets strings (maybe UTF-8 as is) allowing to
use regexp with binary too, not only newline terminated chunks etc.?

--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 12:38 [9fans] Octets regexp tlaronde
@ 2013-05-02 12:48 ` erik quanstrom
  2013-05-02 13:25   ` tlaronde
  2013-05-02 16:16 ` tlaronde
  1 sibling, 1 reply; 29+ messages in thread
From: erik quanstrom @ 2013-05-02 12:48 UTC (permalink / raw)
  To: 9fans

> Regexp(6) handles "characters" that are runes.

perhaps the man page is misleading.  rune in this context means utf-8.
see regexp(2).  all the functions take char*s.

> I wonder if Plan9 developers, when trying to design a way towards some
> localization, have ever thought of bytes (octets) regexp, that is using
> regexp with not rune but octets strings (maybe UTF-8 as is) allowing to
> use regexp with binary too, not only newline terminated chunks etc.?

one of the points of plan 9 was to standardize on one character set,
utf-8.  imho, localization and character set aren't related unless one
is dealing with 8859-x overlays or some other character set insufficient
to represent the range of languages.

however, sam and acme allow for structured regular expressions,
and are generally not line oriented:

http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf

and iirc, cinap has written a cifs bit that uses a bit of binary matching.

- erik



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 12:48 ` erik quanstrom
@ 2013-05-02 13:25   ` tlaronde
  2013-05-02 13:43     ` Tristan
                       ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: tlaronde @ 2013-05-02 13:25 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, May 02, 2013 at 08:48:06AM -0400, erik quanstrom wrote:
> > Regexp(6) handles "characters" that are runes.
>
> perhaps the man page is misleading.  rune in this context means utf-8.
> see regexp(2).  all the functions take char*s.

But the source files deal with runes...

>
> one of the points of plan 9 was to standardize on one character set,
> utf-8.  imho, localization and character set aren't related unless one
> is dealing with 8859-x overlays or some other character set insufficient
> to represent the range of languages.
>

Localization (as "handled" in POSIX for example) is a mess. So the Plan9
solution, with still octets (UTF-8) makes far more sense, since it
allows to extend, for the user, the "characters" that can be used in
naming computer objects, but this is just for nicknames: the system
still speaks C/9P.

So it is better, except perhaps for one thing: for me, the system
"speaks" C or even, obviously, "Plan9" (well: 9P). It does not have
to speak french, hebrew, etc. or even english! So it takes or gives
bytes, and this is good.  But the UTF-8 encoding is the main convention
for user interface, but can it be unset? I mean, can one use a
"raw" window, putting uninterpreted bytes, and rendering bytes (with
a special "ASCII" font with whether ASCII + "0xdd" glyphes or whatever,
using fonts to do what is done with vis(1) on Unices or od(1)/xd(1))
and do not impose the assumption that the octet strings is UTF-8? Can
one make a file entering bytes---i.e. binary values that yield
incorrect UTF-8 sequences?

This is a reflexion made to me by a developer who can use, when
needed, regexp (ed(1) or sed(1)) on an Unix where they still deal
with "char" (bytes) to search for a string of bytes in a binary.

And after some thought, I don't see an obvious reason why the regexp
could not be used with bytes strings (so UTF-8 is OK) without trying to
match runes (since not every bytes string is a correct UTF-8 sequence).

Corollary: I don't know if there is an UTF-8 sequence that can tell:
stop interpreting as UTF-8, takes "as is" (except every incorrect
sequence, problem being to come back from there: if everything is OK "as
is", what can be interpreted as: "stops raw, restart
UTF-8"---solution: this is on user level, not low level, and this is in
the shell explicitely delimiting chunks, like "'" is the only delimiter,
and every embedded "'" has to be "escaped" by doubling it).
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 13:25   ` tlaronde
@ 2013-05-02 13:43     ` Tristan
  2013-05-02 14:19       ` Tristan
  2013-05-02 14:51       ` tlaronde
  2013-05-02 13:44     ` erik quanstrom
  2013-05-02 14:58     ` a
  2 siblings, 2 replies; 29+ messages in thread
From: Tristan @ 2013-05-02 13:43 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> And after some thought, I don't see an obvious reason why the regexp
> could not be used with bytes strings (so UTF-8 is OK) without trying to
> match runes (since not every bytes string is a correct UTF-8 sequence).

with octet based regexps, [Þþ] doesn't match þ, but 0xc3, 0xbe and 0x9e
independantly.

tristan

-- 
All original matter is hereby placed immediately under the public domain.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 13:25   ` tlaronde
  2013-05-02 13:43     ` Tristan
@ 2013-05-02 13:44     ` erik quanstrom
  2013-05-02 14:43       ` tlaronde
  2013-05-02 14:58     ` a
  2 siblings, 1 reply; 29+ messages in thread
From: erik quanstrom @ 2013-05-02 13:44 UTC (permalink / raw)
  To: 9fans

> This is a reflexion made to me by a developer who can use, when
> needed, regexp (ed(1) or sed(1)) on an Unix where they still deal
> with "char" (bytes) to search for a string of bytes in a binary.

i have never needed to do this.  could you provide some motiviation
for grepping for a wierd byte in an executable?  surely the debugger
is better suited for this.

> And after some thought, I don't see an obvious reason why the regexp
> could not be used with bytes strings (so UTF-8 is OK) without trying to
> match runes (since not every bytes string is a correct UTF-8 sequence).

because it makes things more complicated and probablly worse for the
common case, while not providing an new functionality already in
other tools.

> Corollary: I don't know if there is an UTF-8 sequence that can tell:
> stop interpreting as UTF-8, takes "as is" (except every incorrect
> sequence, problem being to come back from there: if everything is OK "as
> is", what can be interpreted as: "stops raw, restart
> UTF-8"---solution: this is on user level, not low level, and this is in
> the shell explicitely delimiting chunks, like "'" is the only delimiter,
> and every embedded "'" has to be "escaped" by doubling it).

i think you've missed the point of making utf-8 *the* character set.
it's not sometimes the character set.  or only on tuesday.  it's always
the character set.

- erik



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 13:43     ` Tristan
@ 2013-05-02 14:19       ` Tristan
  2013-05-02 14:51       ` tlaronde
  1 sibling, 0 replies; 29+ messages in thread
From: Tristan @ 2013-05-02 14:19 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

putting a little more thought into your actual problem, use tcs:

tcs -f 8859-1

which (as i remember) will map 0x80-ff to U0080-00ff and you can use
normal utf8 regular expressions.

tristan

-- 
All original matter is hereby placed immediately under the public domain.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 13:44     ` erik quanstrom
@ 2013-05-02 14:43       ` tlaronde
  0 siblings, 0 replies; 29+ messages in thread
From: tlaronde @ 2013-05-02 14:43 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, May 02, 2013 at 09:44:38AM -0400, erik quanstrom wrote:
> > This is a reflexion made to me by a developer who can use, when
> > needed, regexp (ed(1) or sed(1)) on an Unix where they still deal
> > with "char" (bytes) to search for a string of bytes in a binary.
>
> i have never needed to do this.  could you provide some motiviation
> for grepping for a wierd byte in an executable?  surely the debugger
> is better suited for this.
>

Because everything is not a program? But maybe data? For example, the
TeX (or METAFONT etc.) predigested dumps are binary, but not program.

> > And after some thought, I don't see an obvious reason why the regexp
> > could not be used with bytes strings (so UTF-8 is OK) without trying to
> > match runes (since not every bytes string is a correct UTF-8 sequence).
>
> because it makes things more complicated and probablly worse for the
> common case, while not providing an new functionality already in
> other tools.
>

Ah? I thought the purpose was to have not duplicated tools... And I'm
not quite sure it would be more complicated for common cases since already
defined functions could be wrappers calling more low level functions,
with the definition of the size of the "entity"---byte, wyde, tetra,
octa (when I'm at it: endianess too) or UTF-8.

>
> i think you've missed the point of making utf-8 *the* character set.
> it's not sometimes the character set.  or only on tuesday.  it's always
> the character set.
>
No: I have understood this. What I'm not totally sure about, is that the
system deals with octet strings (as it have), and this UTF-8 i.e.
Unicode is on the user interface, but is there a mean to not have the
interface interpret the strings as UTF-8? Because everything is not
text.

--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 13:43     ` Tristan
  2013-05-02 14:19       ` Tristan
@ 2013-05-02 14:51       ` tlaronde
  2013-05-02 15:02         ` Bence Fábián
  2013-05-02 15:10         ` Kurt H Maier
  1 sibling, 2 replies; 29+ messages in thread
From: tlaronde @ 2013-05-02 14:51 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, May 02, 2013 at 09:43:10AM -0400, Tristan wrote:
> > And after some thought, I don't see an obvious reason why the regexp
> > could not be used with bytes strings (so UTF-8 is OK) without trying to
> > match runes (since not every bytes string is a correct UTF-8 sequence).
> 
> with octet based regexps, [Þþ] doesn't match þ, but 0xc3, 0xbe and 0x9e
> independantly.
> 

Regexp knows subexpressions. So it could be achieved, and one could even
have the present functions be higher level ones, calling more basic ones
dealing with bytes (a rune specified by an UTF-8 sequence being replaced
by a subexpression) or even dealing with various sizes of element
(character; but one fixed size for the processing).

Or even a specification à la C: by adding a leading 'L' meaning:
treat the string as UTF-8 that is masters runes. And if not, leave
it alone.

-- 
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 13:25   ` tlaronde
  2013-05-02 13:43     ` Tristan
  2013-05-02 13:44     ` erik quanstrom
@ 2013-05-02 14:58     ` a
  2013-05-02 15:08       ` tlaronde
  2 siblings, 1 reply; 29+ messages in thread
From: a @ 2013-05-02 14:58 UTC (permalink / raw)
  To: 9fans

your exact problem still isn't clear to me, but certainly there've
been times when I want to search for some array of characters
in a binary blob. i don't believe i've needed anything beyond a
literal string of bytes, but i could imagine from there the utility of
something regexp-like.

i think the answer is just "no, there's no way to do that today".
and i'd strongly advise keeping that tool as far away from any
discussion of localization or character sets or runes or the like.
there's oughtn't be any mode switching or the like: it's utf-8
encoded unicode runes, or it's binary, not characters at all.

hex editors are useful sometimes. being able to do more
complicated searches/edits/replaces/whatever could be
similarly useful sometimes. but don't go anywhere near the
character set or localization discussions with it.

anthony




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 14:51       ` tlaronde
@ 2013-05-02 15:02         ` Bence Fábián
  2013-05-02 15:20           ` tlaronde
  2013-05-02 15:10         ` Kurt H Maier
  1 sibling, 1 reply; 29+ messages in thread
From: Bence Fábián @ 2013-05-02 15:02 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 274 bytes --]

you want to change default behaviour and make the usual usecase special?


2013/5/2 <tlaronde@polynum.com>

>
> Or even a specification à la C: by adding a leading 'L' meaning:
> treat the string as UTF-8 that is masters runes. And if not, leave
> it alone.
>
>

[-- Attachment #2: Type: text/html, Size: 651 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 14:58     ` a
@ 2013-05-02 15:08       ` tlaronde
  2013-05-02 15:19         ` erik quanstrom
  0 siblings, 1 reply; 29+ messages in thread
From: tlaronde @ 2013-05-02 15:08 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, May 02, 2013 at 10:58:30AM -0400, a@9srv.net wrote:
>
> i think the answer is just "no, there's no way to do that today".
> and i'd strongly advise keeping that tool as far away from any
> discussion of localization or character sets or runes or the like.
> there's oughtn't be any mode switching or the like: it's utf-8
> encoded unicode runes, or it's binary, not characters at all.
>

But that is exactly my point: to have localization far from regexp.
Regexp taking simply a string of bytes and matching strings of bytes.
(The main advantage of UTF-8 is not, for me, Unicode (UTF-8 could
survive being an encoding for something else than Unicode), but
precisely that it is still strings of octets, and that the system
can be left alone, far from localization.)

This is a side effect of not Unicode (UTF-8) aware tools to be able to
be used with whatever string of bytes since no interpretation is done.

One could even imagine using regexp to find a pattern in an image (even
a sed like program, trying to math a first row pattern, and then looking
for the following rows if some patterns are matched too).

--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 14:51       ` tlaronde
  2013-05-02 15:02         ` Bence Fábián
@ 2013-05-02 15:10         ` Kurt H Maier
  2013-05-02 15:21           ` tlaronde
  1 sibling, 1 reply; 29+ messages in thread
From: Kurt H Maier @ 2013-05-02 15:10 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Why does this functionality have to be overloaded into existing tools
that are already in common use?

khm



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 15:08       ` tlaronde
@ 2013-05-02 15:19         ` erik quanstrom
  2013-05-02 15:31           ` tlaronde
  2013-05-02 18:45           ` dexen deVries
  0 siblings, 2 replies; 29+ messages in thread
From: erik quanstrom @ 2013-05-02 15:19 UTC (permalink / raw)
  To: 9fans

> But that is exactly my point: to have localization far from regexp.
> Regexp taking simply a string of bytes and matching strings of bytes.

the plan 9 model is that all text is utf-8, with the exception of
internal encodings which may be Runes.

is your proposal
- to change programs that take regular expressions to be exceptions to
the plan 9 text model, or
- to change the plan 9 text model
?

either way, i think the bar should be high to change the text model
for plan 9, and higher to make exceptions.

- erik



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 15:02         ` Bence Fábián
@ 2013-05-02 15:20           ` tlaronde
  2013-05-02 15:27             ` erik quanstrom
  0 siblings, 1 reply; 29+ messages in thread
From: tlaronde @ 2013-05-02 15:20 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, May 02, 2013 at 05:02:45PM +0200, Bence Fábián wrote:
> you want to change default behaviour and make the usual usecase special?
> 

For the moment, I don't want to change anything, I'm trying to be
convinced where the border has to be: "characters" (for me user
level) on the one side, octets strings on the other system and
library side (on a distributed system, it makes sense that filenames,
being userlevel nicknames be UTF-8---supposed to be UTF-8 without any
per filename codepage or whatever).  

The usual behavior could perfectly be the same (the
leading "L" was just an exemple; it could be reversed; and octets
matching could simply be called by new functions---new names---, the
historical ones calling these new character agnostic ones). The
problem is not there. The problem is: are regexp only useful with
"text" implying "characters", or more widely useful? My feeling is
that they are more generally useful.

-- 
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 15:10         ` Kurt H Maier
@ 2013-05-02 15:21           ` tlaronde
  0 siblings, 0 replies; 29+ messages in thread
From: tlaronde @ 2013-05-02 15:21 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, May 02, 2013 at 11:10:34AM -0400, Kurt H Maier wrote:
> Why does this functionality have to be overloaded into existing tools
> that are already in common use?
>

I'm speaking about the libregexp. Not about the use existing tools do
with it.
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 15:20           ` tlaronde
@ 2013-05-02 15:27             ` erik quanstrom
  0 siblings, 0 replies; 29+ messages in thread
From: erik quanstrom @ 2013-05-02 15:27 UTC (permalink / raw)
  To: 9fans

> For the moment, I don't want to change anything, I'm trying to be
> convinced where the border has to be: "characters" (for me user
> level) on the one side, octets strings on the other system and
> library side (on a distributed system, it makes sense that filenames,
> being userlevel nicknames be UTF-8---supposed to be UTF-8 without any
> per filename codepage or whatever).

there is currently no such distinction between user and library.
this eliminates context.  one never is confronted with, "oh, i can't
call that because that's a user function, not a library function".

- erik



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 15:19         ` erik quanstrom
@ 2013-05-02 15:31           ` tlaronde
  2013-05-02 16:53             ` erik quanstrom
  2013-05-02 18:45           ` dexen deVries
  1 sibling, 1 reply; 29+ messages in thread
From: tlaronde @ 2013-05-02 15:31 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, May 02, 2013 at 11:19:38AM -0400, erik quanstrom wrote:
>
> the plan 9 model is that all text is utf-8, with the exception of
> internal encodings which may be Runes.
>
> is your proposal
> - to change programs that take regular expressions to be exceptions to
> the plan 9 text model, or

No: to have a libregexp being agnostic about any encoding. The tools can
stay, for user, the same, simply libregexp would not be "text" based but
octets based.

> - to change the plan 9 text model

Neither. The "text" model is a user interface. My question is simply how
is it difficult to have an alternative, special purpose, user interface,
that do not have the UTF-8 filter for input from and output to the user
interface.

--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 12:38 [9fans] Octets regexp tlaronde
  2013-05-02 12:48 ` erik quanstrom
@ 2013-05-02 16:16 ` tlaronde
  1 sibling, 0 replies; 29+ messages in thread
From: tlaronde @ 2013-05-02 16:16 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, May 02, 2013 at 02:38:25PM +0200, tlaronde@polynum.com wrote:
> Regexp(6) handles "characters" that are runes.
>

Answering to myself: regexp deals with entities called "characters".
Some regexp specifications ('.', ranges, classes etc.) apply to
"characters".

This means that the size of the character has to be known, and one can
not deal directly with UTF-8 for example ignoring it is UTF-8 since '.'
for example is a variable size sequence, whose start depends on
what was before.

So a libregexp dealing with not only runes will be possible, but would
need to specify the fixed size of the characters, i.e. the "encoding"
of the input (this has nothing to do with localization; but with what is
an elementary entity).

--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 15:31           ` tlaronde
@ 2013-05-02 16:53             ` erik quanstrom
  2013-05-02 18:59               ` tlaronde
  0 siblings, 1 reply; 29+ messages in thread
From: erik quanstrom @ 2013-05-02 16:53 UTC (permalink / raw)
  To: 9fans

> > is your proposal
> > - to change programs that take regular expressions to be exceptions to
> > the plan 9 text model, or
>
> No: to have a libregexp being agnostic about any encoding. The tools can
> stay, for user, the same, simply libregexp would not be "text" based but
> octets based.

there's always an encoding.

> > - to change the plan 9 text model
>
> Neither. The "text" model is a user interface. My question is simply how
> is it difficult to have an alternative, special purpose, user interface,
> that do not have the UTF-8 filter for input from and output to the user
> interface.

i see we're at an impass.  since i don't agree that utf-8 is a user
interface thing.  it's more entrenched than that.

why don't you code something up?

- erik



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 15:19         ` erik quanstrom
  2013-05-02 15:31           ` tlaronde
@ 2013-05-02 18:45           ` dexen deVries
  2013-05-02 19:04             ` tlaronde
  1 sibling, 1 reply; 29+ messages in thread
From: dexen deVries @ 2013-05-02 18:45 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

please pardon the silly question, but... how about piping the binary data 
through xd(1) before sending it to regexp(3)?

-- 
dexen deVries

[[[↓][→]]]

I have seen the Great Pretender and he is not what he seems.




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 16:53             ` erik quanstrom
@ 2013-05-02 18:59               ` tlaronde
  0 siblings, 0 replies; 29+ messages in thread
From: tlaronde @ 2013-05-02 18:59 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, May 02, 2013 at 12:53:19PM -0400, erik quanstrom wrote:
>
> i see we're at an impass.  since i don't agree that utf-8 is a user
> interface thing.  it's more entrenched than that.
>
> why don't you code something up?

Because I have started sketching (this was for kerTeX/RISK) "basys" i.e.
basic system tools, but I'm trying to decide whether I start from mainly
BSD tools (ash, libregex, sed, ed and the small set of utilities
used by RISK or by kerTeX package framework), or from Plan9 ones
(rc has some features that are worth them). But I want "basys" to
be a "C language" system---the system speaks Cee, and that's all;
a not integer number is given with a '.' and not a ',' for Frenchs
and so on (this is an example of POSIX hell: the *printf() and
*scanf() take the localization to decide how to interpret or render
numbers, and even if they are used to read files, not interacting
with the user, whatever user environment value spoils the thing if you
have not protected against in the code...), dealing with octets
strings (for user language, let them be UTF-8; but system strictly
doesn't care:  this is octets strings) and for libregex(p) the rune
thing does not appeal to me (correction: the only rune thing, even
if for a definition of "character" this does make sense).

I might as well end up with a modified sh or rc that deals with C
strings (with a L---for "hell"?---for UTF-8, nothing for octets, W for
wydes, T for tetras and O for octas and even a modifier for endianess).

But contrary to what is "state of the art", I take long to study and
make things clear (to myself... YMMV), and after that I urge on
implementing in the direction I have chosen (it may take "calendar"
time; but this is simply because of limited slots of time; during
these slots I don't wonder about what has to be done: it is already
decided...). Till I have made the choice...

I have already decided that I will implement a bar(1) that only packs
the data with a "volume" listing in text whatever attributes in a form
attribute=value are linked to the data (this is, in some sens, what RISK
already does with rkinstall(1), except that it uses tar(1) to pack data).
That is bar(1) will be a pure C89 program without any system dependent
part (this will allow to do whatever with the data, for example changing
names to fit local conventions---the man hierarchy; compressing man
pages; caching the rendering; adding extensions etc.).
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 18:45           ` dexen deVries
@ 2013-05-02 19:04             ` tlaronde
  2013-05-02 19:22               ` erik quanstrom
  0 siblings, 1 reply; 29+ messages in thread
From: tlaronde @ 2013-05-02 19:04 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, May 02, 2013 at 08:45:28PM +0200, dexen deVries wrote:
> please pardon the silly question, but... how about piping the binary data
> through xd(1) before sending it to regexp(3)?
>

Because it will work only for some cases, since newlines and formatting
come in the picture and it still imposes to have regexp rune compatible,
i.e. not every sequence is allowed it has to be an UTF-8 compatible one.

--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 19:04             ` tlaronde
@ 2013-05-02 19:22               ` erik quanstrom
  2013-05-02 19:39                 ` tlaronde
  0 siblings, 1 reply; 29+ messages in thread
From: erik quanstrom @ 2013-05-02 19:22 UTC (permalink / raw)
  To: 9fans

> On Thu, May 02, 2013 at 08:45:28PM +0200, dexen deVries wrote:
> > please pardon the silly question, but... how about piping the binary data
> > through xd(1) before sending it to regexp(3)?
> >
>
> Because it will work only for some cases, since newlines and formatting
> come in the picture and it still imposes to have regexp rune compatible,
> i.e. not every sequence is allowed it has to be an UTF-8 compatible one.

can you give an example of xd outputting something that's not a rune?

- erik



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 19:22               ` erik quanstrom
@ 2013-05-02 19:39                 ` tlaronde
  2013-05-02 20:13                   ` erik quanstrom
  2013-05-02 20:17                   ` 9p-st
  0 siblings, 2 replies; 29+ messages in thread
From: tlaronde @ 2013-05-02 19:39 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, May 02, 2013 at 03:22:21PM -0400, erik quanstrom wrote:
>
> can you give an example of xd outputting something that's not a rune?
>

Indeed, if the regexp is an ASCII representation matching xd outputs
there is not _this_ problem. But this is limited regexp, since one can
not use "character" ranges (it depends on the size); not '.'; because
the conversion has to be done; because there is still the newline
problem (that is added; not something in the original data) (if
functions have been added to not deal with the newline, it is because
the newline is a problem, and because regexp have a more wider use than
"text").

--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 19:39                 ` tlaronde
@ 2013-05-02 20:13                   ` erik quanstrom
  2013-05-02 20:17                   ` 9p-st
  1 sibling, 0 replies; 29+ messages in thread
From: erik quanstrom @ 2013-05-02 20:13 UTC (permalink / raw)
  To: 9fans

> Indeed, if the regexp is an ASCII representation matching xd outputs
> there is not _this_ problem. But this is limited regexp, since one can
> not use "character" ranges (it depends on the size); not '.'; because

now you're at both ends.  the whole reason for this approach is to
match bytes that aren't valid runes.  so why complain that it does
what you want?

- erik



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 19:39                 ` tlaronde
  2013-05-02 20:13                   ` erik quanstrom
@ 2013-05-02 20:17                   ` 9p-st
  2013-05-03 11:16                     ` tlaronde
  1 sibling, 1 reply; 29+ messages in thread
From: 9p-st @ 2013-05-02 20:17 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> On Thu, May 02, 2013 at 03:22:21PM -0400, erik quanstrom wrote:
> > can you give an example of xd outputting something that's not a rune?

if we're talking about xd, i'll suggest 'tcs -f 8859-1' again in which case:

> Indeed, if the regexp is an ASCII representation matching xd outputs
> there is not _this_ problem. But this is limited regexp, since one can
> not use "character" ranges (it depends on the size); not '.';

these problems go away

> because the conversion has to be done;

this remains

> because there is still the newline problem (that is added; not something in
> the original data) (if functions have been added to not deal with the
> newline, it is because the newline is a problem, and because regexp have a
> more wider use than "text").

and this problem goes away.

i imagine you'll still have problems with embedded NULs, but that's C
strings for you...

if you want a library function, use rregexec(2) and rregsub(2) with only
the low byte of each Rune filled...

(and yes, your data does quadruple itself)

tristan

-- 
All original matter is hereby placed immediately under the public domain.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-02 20:17                   ` 9p-st
@ 2013-05-03 11:16                     ` tlaronde
  2013-05-03 13:15                       ` Tristan
  0 siblings, 1 reply; 29+ messages in thread
From: tlaronde @ 2013-05-03 11:16 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, May 02, 2013 at 04:17:11PM -0400, 9p-st@imu.li wrote:
>
> if we're talking about xd, i'll suggest 'tcs -f 8859-1' again in which case:
>

My question was _not_ related to text, and _not_ related to "french" i.e.
8859-1. I know how to deal with this.

--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-03 11:16                     ` tlaronde
@ 2013-05-03 13:15                       ` Tristan
  2013-05-03 16:33                         ` tlaronde
  0 siblings, 1 reply; 29+ messages in thread
From: Tristan @ 2013-05-03 13:15 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> > if we're talking about xd, i'll suggest 'tcs -f 8859-1' again in which case:

> My question was _not_ related to text, and _not_ related to "french" i.e.
> 8859-1. I know how to deal with this.

tcs -f 8859-1

will take your _binary_ files, and replace the bytes 0x80-0xff with the
unicode points U0080-U00ff, so you can use the standard regexps and tools
on them. and just convert back afterwards.

maybe it's not meant to be used that way, but it _works_. try it.

have fun!
tristan

-- 
All original matter is hereby placed immediately under the public domain.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [9fans] Octets regexp
  2013-05-03 13:15                       ` Tristan
@ 2013-05-03 16:33                         ` tlaronde
  0 siblings, 0 replies; 29+ messages in thread
From: tlaronde @ 2013-05-03 16:33 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Fri, May 03, 2013 at 09:15:27AM -0400, Tristan wrote:
>
> tcs -f 8859-1
>
> will take your _binary_ files, and replace the bytes 0x80-0xff with the
> unicode points U0080-U00ff, so you can use the standard regexps and tools
> on them. and just convert back afterwards.
>

OK, mea culpa... since I'm french, I focused on the latin1 thinking
this has something to do with my language and the custom to deal with
latin1 on other systems.

I guess I could create a keyboard that produces not UTF-8 but bytes
so to have a mean to input bytes (without resorting to printf or
whatever). Remains the problem of the rendering (or create a
special font that displays octal, hexadecimal or whatever playing
with the index of the glyphes; but this will work for octets, and will
be more difficult if one wants to deal with wydes; impossible with
tetras and octas).

--
	Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
		      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006
F40C



^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2013-05-03 16:33 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-02 12:38 [9fans] Octets regexp tlaronde
2013-05-02 12:48 ` erik quanstrom
2013-05-02 13:25   ` tlaronde
2013-05-02 13:43     ` Tristan
2013-05-02 14:19       ` Tristan
2013-05-02 14:51       ` tlaronde
2013-05-02 15:02         ` Bence Fábián
2013-05-02 15:20           ` tlaronde
2013-05-02 15:27             ` erik quanstrom
2013-05-02 15:10         ` Kurt H Maier
2013-05-02 15:21           ` tlaronde
2013-05-02 13:44     ` erik quanstrom
2013-05-02 14:43       ` tlaronde
2013-05-02 14:58     ` a
2013-05-02 15:08       ` tlaronde
2013-05-02 15:19         ` erik quanstrom
2013-05-02 15:31           ` tlaronde
2013-05-02 16:53             ` erik quanstrom
2013-05-02 18:59               ` tlaronde
2013-05-02 18:45           ` dexen deVries
2013-05-02 19:04             ` tlaronde
2013-05-02 19:22               ` erik quanstrom
2013-05-02 19:39                 ` tlaronde
2013-05-02 20:13                   ` erik quanstrom
2013-05-02 20:17                   ` 9p-st
2013-05-03 11:16                     ` tlaronde
2013-05-03 13:15                       ` Tristan
2013-05-03 16:33                         ` tlaronde
2013-05-02 16:16 ` tlaronde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).