* [COFF] Requesting thoughts on extended regular expressions in grep.
@ 2023-03-02 18:54 Grant Taylor via COFF
2023-03-02 19:23 ` [COFF] " Clem Cole
` (4 more replies)
0 siblings, 5 replies; 46+ messages in thread
From: Grant Taylor via COFF @ 2023-03-02 18:54 UTC (permalink / raw)
To: COFF
[-- Attachment #1: Type: text/plain, Size: 1436 bytes --]
Hi,
I'd like some thoughts ~> input on extended regular expressions used
with grep, specifically GNU grep -e / egrep.
What are the pros / cons to creating extended regular expressions like
the following:
^\w{3}
vs:
^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
Or:
[ :[:digit:]]{11}
vs:
( 1| 2| 3| 4| 5| 6| 7| 8|
9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31)
(0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]
I'm currently eliding the 61st (60) second, the 32nd day, and dealing
with February having fewer days for simplicity.
For matching patterns like the following in log files?
Mar 2 03:23:38
I'm working on organically training logcheck to match known good log
entries. So I'm *DEEP* in the bowels of extended regular expressions
(GNU egrep) that runs over all logs hourly. As such, I'm interested in
making sure that my REs are both efficient and accurate or at least not
WILDLY badly structured. The pedantic part of me wants to avoid
wildcard type matches (\w), even if they are bounded (\w{3}), unless it
truly is for unpredictable text.
I'd appreciate any feedback and recommendations from people who have
been using and / or optimizing (extended) regular expressions for longer
than I have been using them.
Thank you for your time and input.
--
Grant. . . .
unix || die
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4017 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-02 18:54 [COFF] Requesting thoughts on extended regular expressions in grep Grant Taylor via COFF
@ 2023-03-02 19:23 ` Clem Cole
2023-03-02 19:38 ` Grant Taylor via COFF
2023-03-02 23:01 ` Stuff Received
2023-03-02 21:53 ` Dan Cross
` (3 subsequent siblings)
4 siblings, 2 replies; 46+ messages in thread
From: Clem Cole @ 2023-03-02 19:23 UTC (permalink / raw)
To: Grant Taylor; +Cc: COFF
[-- Attachment #1: Type: text/plain, Size: 1788 bytes --]
Grant - check out Russ Cox's web page on this very subject: Implementing
Regular Expressions
<https://streaklinks.com/BaglRcA5OeZepwiX_AELMUI5/https%3A%2F%2Fswtch.com%2F%7Ersc%2Fregexp%2F>
ᐧ
On Thu, Mar 2, 2023 at 1:55 PM Grant Taylor via COFF <coff@tuhs.org> wrote:
> Hi,
>
> I'd like some thoughts ~> input on extended regular expressions used
> with grep, specifically GNU grep -e / egrep.
>
> What are the pros / cons to creating extended regular expressions like
> the following:
>
> ^\w{3}
>
> vs:
>
> ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
>
> Or:
>
> [ :[:digit:]]{11}
>
> vs:
>
> ( 1| 2| 3| 4| 5| 6| 7| 8|
> 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31)
> (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]
>
> I'm currently eliding the 61st (60) second, the 32nd day, and dealing
> with February having fewer days for simplicity.
>
> For matching patterns like the following in log files?
>
> Mar 2 03:23:38
>
> I'm working on organically training logcheck to match known good log
> entries. So I'm *DEEP* in the bowels of extended regular expressions
> (GNU egrep) that runs over all logs hourly. As such, I'm interested in
> making sure that my REs are both efficient and accurate or at least not
> WILDLY badly structured. The pedantic part of me wants to avoid
> wildcard type matches (\w), even if they are bounded (\w{3}), unless it
> truly is for unpredictable text.
>
> I'd appreciate any feedback and recommendations from people who have
> been using and / or optimizing (extended) regular expressions for longer
> than I have been using them.
>
> Thank you for your time and input.
>
>
>
> --
> Grant. . . .
> unix || die
>
>
[-- Attachment #2: Type: text/html, Size: 2619 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-02 19:23 ` [COFF] " Clem Cole
@ 2023-03-02 19:38 ` Grant Taylor via COFF
2023-03-02 23:01 ` Stuff Received
1 sibling, 0 replies; 46+ messages in thread
From: Grant Taylor via COFF @ 2023-03-02 19:38 UTC (permalink / raw)
To: coff
[-- Attachment #1: Type: text/plain, Size: 308 bytes --]
On 3/2/23 12:23 PM, Clem Cole wrote:
> Grant - check out Russ Cox's web page on this very subject: Implementing
> Regular Expressions
Thank you for the pointer Clem.
It's at the top of my reading list. I'll dig into the articles listed
thereon later today.
--
Grant. . . .
unix || die
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4017 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-02 18:54 [COFF] Requesting thoughts on extended regular expressions in grep Grant Taylor via COFF
2023-03-02 19:23 ` [COFF] " Clem Cole
@ 2023-03-02 21:53 ` Dan Cross
2023-03-03 1:05 ` Grant Taylor via COFF
2023-03-03 10:59 ` Ralph Corderoy
` (2 subsequent siblings)
4 siblings, 1 reply; 46+ messages in thread
From: Dan Cross @ 2023-03-02 21:53 UTC (permalink / raw)
To: Grant Taylor; +Cc: COFF
On Thu, Mar 2, 2023 at 1:55 PM Grant Taylor via COFF <coff@tuhs.org> wrote:
> I'd like some thoughts ~> input on extended regular expressions used
> with grep, specifically GNU grep -e / egrep.
>
> What are the pros / cons to creating extended regular expressions like
> the following:
>
> ^\w{3}
>
> vs:
>
> ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
Well, obviously the former matches any sequence 3 of
alpha-numerics/underscores at the beginning of a string, while the
latter only matches abbreviations of months in the western calendar;
that is, the two REs are matching very different things (the latter is
a strict subset of the former). But I suspect you mean in a more
general sense.
> Or:
>
> [ :[:digit:]]{11}
...do you really want to match a space, a colon and a single digit 11
times in a single string?
> vs:
>
> ( 1| 2| 3| 4| 5| 6| 7| 8|
> 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31)
> (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]
Using character classes would greatly simplify what you're trying to
do. It seems like this could be simplified to (untested) snippet:
( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9]
For this, I'd probably eschew `[:digit:]`. Named character classes are
for handy locale support, or in lieu of typing every character in the
alphabet (though we can use ranges to abbreviate that), but it kind of
seems like that's not coming into play here and, IMHO, `[0-9]` is
clearer in context.
> I'm currently eliding the 61st (60) second, the 32nd day, and dealing
> with February having fewer days for simplicity.
It's not clear to me that dates, in their generality, can be matched
with regular expressions. Consider leap years; you'd almost
necessarily have to use backtracking for that, but I admit I haven't
thought it through.
> For matching patterns like the following in log files?
>
> Mar 2 03:23:38
>
> I'm working on organically training logcheck to match known good log
> entries. So I'm *DEEP* in the bowels of extended regular expressions
> (GNU egrep) that runs over all logs hourly. As such, I'm interested in
> making sure that my REs are both efficient and accurate or at least not
> WILDLY badly structured. The pedantic part of me wants to avoid
> wildcard type matches (\w), even if they are bounded (\w{3}), unless it
> truly is for unpredictable text.
`\w` is a GNU extension; I'd probably avoid it on portability grounds
(though `\b` is very handy).
The thing about regular expressions is that they describe regular
languages, and regular languages are those for which there exists a
finite automaton that can recognize the language. An important class
of finite automata are deterministic finite automata; by definition,
recognition by such automata are linear in the length of the input.
However, construction of a DFA for any given regular expression can be
superlinear (in fact, it can be exponential) so practically speaking,
we usually construct non-deterministic finite automata (NDFAs) and
"simulate" their execution for matching. NDFAs generalize DFAs (DFAs
are a subset of NDFAs, incidentally) in that, in any non-terminal
state, there can be multiple subsequent states that the machine can
transition to given an input symbol. When executed, for any state, the
simulator will transition to every permissible subsequent state
simultaneously, discarding impossible states as they become evident.
This implies that NDFA execution is superlinear, but it is bounded,
and is O(n*m*e), where n is the length of the input, m is the number
of nodes in the state transition graph corresponding to the NDFA, and
e is the maximum number of edges leaving any node in that graph (for a
fully connected graph, that would m, so this can be up to O(n*m^2)).
Construction of an NDFA is O(m), so while it's slower to execute, it's
actually possible to construct in a reasonable amount of time. Russ's
excellent series of articles that Clem linked to gives details and
algorithms.
> I'd appreciate any feedback and recommendations from people who have
> been using and / or optimizing (extended) regular expressions for longer
> than I have been using them.
>
> Thank you for your time and input.
In practical terms? Basically, don't worry about it too much. Egrep
will generate an NDFA simulation that's going to be acceptably fast
for all but the weirdest cases.
- Dan C.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-02 19:23 ` [COFF] " Clem Cole
2023-03-02 19:38 ` Grant Taylor via COFF
@ 2023-03-02 23:01 ` Stuff Received
2023-03-02 23:46 ` Steffen Nurpmeso
2023-03-03 1:08 ` Grant Taylor via COFF
1 sibling, 2 replies; 46+ messages in thread
From: Stuff Received @ 2023-03-02 23:01 UTC (permalink / raw)
To: coff
On 2023-03-02 14:23, Clem Cole wrote:
> Grant - check out Russ Cox's web page on this very subject: Implementing
> Regular Expressions
> <https://streaklinks.com/BaglRcA5OeZepwiX_AELMUI5/https%3A%2F%2Fswtch.com%2F%7Ersc%2Fregexp%2F>
Clem, why are you linking through streaklinks.com?
N.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-02 23:01 ` Stuff Received
@ 2023-03-02 23:46 ` Steffen Nurpmeso
2023-03-03 1:08 ` Grant Taylor via COFF
1 sibling, 0 replies; 46+ messages in thread
From: Steffen Nurpmeso @ 2023-03-02 23:46 UTC (permalink / raw)
To: Stuff Received; +Cc: coff
Stuff Received wrote in
<c4328f0f-f304-e131-a9f3-32ce01fa6814@riddermarkfarm.ca>:
|On 2023-03-02 14:23, Clem Cole wrote:
|> Grant - check out Russ Cox's web page on this very subject: Implementing
|> Regular Expressions
|> <https://streaklinks.com/BaglRcA5OeZepwiX_AELMUI5/https%3A%2F%2Fswtch.co\
|> m%2F%7Ersc%2Fregexp%2F>
|
|Clem, why are you linking through streaklinks.com?
I do not want to be unfriendly; (but) i use firefox-bin
(mozilla-compiled) and my only extension is uMatrix that i have
been pointed to, and i can only recommend it highly to anyone
(though the "modern" web mostly requires to turn off tracking
protection and numerous white flags in uMatrix to work), maybe
even to those who simply put their browser into a container.
Anyhow, uMatrix gives you the following, and while i have not
tried to selectively click me through to get to the target,
i could have done so:
streak.com
www.streak.com
cloudflare.com
cdnjs.cloudflare.com
d3e54v103j8qbb.cloudfront.net
facebook.net
connect.facebook.net
google.com
www.google.com
ajax.googleapis.com
storage.googleapis.com
intercom.io
widget.intercom.io
licdn.com
snap.licdn.com
pdst.fm
cdn.pdst.fm
producthunt.com
api.producthunt.com
sentry-cdn.com
js.sentry-cdn.com
website-files.com
assets.website-files.com
assets-global.website-files.com
google-analytics.com
www.google-analytics.com
googletagmanager.com
www.googletagmanager.com
Randomized links i find just terrible. IETF started using
randomized archive links, which are mesmerising; most often
mailman archive links give you a bit of orientation by themselves,
isn't that more appealing.
--steffen
|
|Der Kragenbaer, The moon bear,
|der holt sich munter he cheerfully and one by one
|einen nach dem anderen runter wa.ks himself off
|(By Robert Gernhardt)
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-02 21:53 ` Dan Cross
@ 2023-03-03 1:05 ` Grant Taylor via COFF
2023-03-03 3:04 ` Dan Cross
0 siblings, 1 reply; 46+ messages in thread
From: Grant Taylor via COFF @ 2023-03-03 1:05 UTC (permalink / raw)
To: coff
[-- Attachment #1: Type: text/plain, Size: 5704 bytes --]
On 3/2/23 2:53 PM, Dan Cross wrote:
> Well, obviously the former matches any sequence 3 of
> alpha-numerics/underscores at the beginning of a string, while the
> latter only matches abbreviations of months in the western calendar;
> that is, the two REs are matching very different things (the latter
> is a strict subset of the former).
I completely agree with you. That's also why I'm wanting to start
utilizing the latter, more specific RE. But I don't know where the line
of over complicating things is to avoid crossing it.
> But I suspect you mean in a more general sense.
Yes and no. Does the comment above clarify at all?
> ...do you really want to match a space, a colon and a single digit
> 11 times ...
Yes.
> ... in a single string?
What constitutes a single string? ;-) I sort of rhetorically ask.
The log lines start with
MMM dd hh:mm:ss
Where:
- MMM is the month abbreviation
- dd is the day of the month
- hh is the hour of the day
- mm is the minute of the hour
- ss is the second of the minute
So, yes, there are eleven characters that fall into the class consisting
of a space or a colon or a number.
Is that a single string? It depends what you're looking at, the
sequences of non white space in the log? No. The patter that I'm
matching ya.
> Using character classes would greatly simplify what you're trying to
> do. It seems like this could be simplified to (untested) snippet:
Agreed.
I'm starting with the examples that came with; "^\w{3} [
:[:digit:]]{11}", the logcheck package that I'm working with and
evaluating what I want to do.
I actually like the idea of dividing out the following:
- months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, and Dec
- months that have 30 days: Apr, Jun, Sep, Nov
- month that have 28/29 days: Feb
> ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9]
Aside: Why do you have the double square brackets in "[12][[0-9]]"?
> For this, I'd probably eschew `[:digit:]`. Named character classes
> are for handy locale support, or in lieu of typing every character
> in the alphabet (though we can use ranges to abbreviate that), but
> it kind of seems like that's not coming into play here and, IMHO,
> `[0-9]` is clearer in context.
ACK
"[[:digit:]]+" was a construct that I'm parroting. It and
[.:[:xdigit:]]+ are good for some things. But they definitely aren't
the best for all things.
Hence trying to find the line of being more accurate without going too far.
> It's not clear to me that dates, in their generality, can be
> matched with regular expressions. Consider leap years; you'd almost
> necessarily have to use backtracking for that, but I admit I haven't
> thought it through.
Given the context that these extended regular expressions are going to
be used in, logcheck -- filtering out known okay log entries to email
what doesn't get filtered -- I'm okay with having a few things slip
through like leap day / leap seconds / leap frogs.
> `\w` is a GNU extension; I'd probably avoid it on portability grounds
> (though `\b` is very handy).
I hear, understand, and acknowledge your concern. At present, these
filters are being used in a package; logcheck, which I believe is
specific to Debian and ilk. As such, GNU grep is very much a thing.
I'm also not a fan of the use of `\w` and would prefer to (...|...) things.
> The thing about regular expressions is that they describe regular
> languages, and regular languages are those for which there exists a
> finite automaton that can recognize the language. An important class
> of finite automata are deterministic finite automata; by definition,
> recognition by such automata are linear in the length of the input.
>
> However, construction of a DFA for any given regular expression can be
> superlinear (in fact, it can be exponential) so practically speaking,
> we usually construct non-deterministic finite automata (NDFAs) and
> "simulate" their execution for matching. NDFAs generalize DFAs (DFAs
> are a subset of NDFAs, incidentally) in that, in any non-terminal
> state, there can be multiple subsequent states that the machine can
> transition to given an input symbol. When executed, for any state,
> the simulator will transition to every permissible subsequent state
> simultaneously, discarding impossible states as they become evident.
>
> This implies that NDFA execution is superlinear, but it is bounded,
> and is O(n*m*e), where n is the length of the input, m is the number
> of nodes in the state transition graph corresponding to the NDFA, and
> e is the maximum number of edges leaving any node in that graph (for
> a fully connected graph, that would m, so this can be up to O(n*m^2)).
> Construction of an NDFA is O(m), so while it's slower to execute, it's
> actually possible to construct in a reasonable amount of time. Russ's
> excellent series of articles that Clem linked to gives details and
> algorithms.
I only vaguely understand those three paragraphs as they are deeper
computer science than I've gone before.
I think I get the gist of them but could not explain them if my life
depended upon it.
> In practical terms? Basically, don't worry about it too much. Egrep
> will generate an NDFA simulation that's going to be acceptably fast
> for all but the weirdest cases.
ACK
It sounds like I can make any reasonable extended regular expression a
human can read and I'll probably be good.
Thank you for the detailed response Dan. :-)
--
Grant. . . .
unix || die
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4017 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-02 23:01 ` Stuff Received
2023-03-02 23:46 ` Steffen Nurpmeso
@ 2023-03-03 1:08 ` Grant Taylor via COFF
2023-03-03 2:10 ` Dave Horsfall
1 sibling, 1 reply; 46+ messages in thread
From: Grant Taylor via COFF @ 2023-03-03 1:08 UTC (permalink / raw)
To: coff
[-- Attachment #1: Type: text/plain, Size: 560 bytes --]
On 3/2/23 4:01 PM, Stuff Received wrote:
> Clem, why are you linking through streaklinks.com?
Here's a direct link to the page that I landed on when following Clem's
link:
Link - Implementing Regular Expressions
- https://swtch.com/~rsc/regexp/
I didn't pay attention to Clem's link beyond the fact that I got to the
desired page without needing to tilt at my various filtering plugins.
Though the message I'm replying to has caused a few brain cells to find
themselves in confusion ~> curiosity.
--
Grant. . . .
unix || die
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4017 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-03 1:08 ` Grant Taylor via COFF
@ 2023-03-03 2:10 ` Dave Horsfall
2023-03-03 3:34 ` Grant Taylor via COFF
0 siblings, 1 reply; 46+ messages in thread
From: Dave Horsfall @ 2023-03-03 2:10 UTC (permalink / raw)
To: Computer Old Farts Followers
On Thu, 2 Mar 2023, Grant Taylor via COFF wrote:
> Though the message I'm replying to has caused a few brain cells to find
> themselves in confusion ~> curiosity.
Because evil things can happen with URL redirectors; personally I like to
know where I'm going beforehand...
-- Dave
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-03 1:05 ` Grant Taylor via COFF
@ 2023-03-03 3:04 ` Dan Cross
2023-03-03 3:53 ` Grant Taylor via COFF
0 siblings, 1 reply; 46+ messages in thread
From: Dan Cross @ 2023-03-03 3:04 UTC (permalink / raw)
To: Grant Taylor; +Cc: coff
[-- Attachment #1: Type: text/plain, Size: 9408 bytes --]
On Thu, Mar 2, 2023 at 8:06 PM Grant Taylor via COFF <coff@tuhs.org> wrote:
> On 3/2/23 2:53 PM, Dan Cross wrote:
> > Well, obviously the former matches any sequence 3 of
> > alpha-numerics/underscores at the beginning of a string, while the
> > latter only matches abbreviations of months in the western calendar;
> > that is, the two REs are matching very different things (the latter
> > is a strict subset of the former).
>
> I completely agree with you. That's also why I'm wanting to start
> utilizing the latter, more specific RE. But I don't know where the line
> of over complicating things is to avoid crossing it.
I guess what I'm saying is, match what you want to match and don't sweat
the small stuff.
> > But I suspect you mean in a more general sense.
>
> Yes and no. Does the comment above clarify at all?
Not exactly. :-)
What I understand you to mean, based on this and the rest of your note, is
that you want to find a good division point between overly specific,
complex REs and simpler, easy to understand REs that are less specific. The
danger with the latter is that they may match things you don't intend,
while the former are harder to maintain and (arguably) more brittle. I can
sympathize.
> > ...do you really want to match a space, a colon and a single digit
> > 11 times ...
>
> Yes.
>
> > ... in a single string?
>
> What constitutes a single string? ;-) I sort of rhetorically ask.
For the purposes of grep/egrep, that'll be a logical "line" of text,
terminated by a newline, though the newline itself isn't considered part of
the text for matching. I believe the `-z` option can be used to set a NUL
byte as the "line" terminator; presumably this lets one match strings with
embedded newlines, though I haven't tried.
> The log lines start with
>
> MMM dd hh:mm:ss
>
> Where:
> - MMM is the month abbreviation
> - dd is the day of the month
> - hh is the hour of the day
> - mm is the minute of the hour
> - ss is the second of the minute
>
> So, yes, there are eleven characters that fall into the class consisting
> of a space or a colon or a number.
>
> Is that a single string? It depends what you're looking at, the
> sequences of non white space in the log? No. The patter that I'm
> matching ya.
"string" in this context is the input you're attempting to match against.
`egrep` will attempt to match your pattern against each "line" of text it
reads from the files its searching. That is, each line in your log file(s).
But consider what `[ :[:digit:]]{11}` means: you've got a character class
consisting of space, colon and a digit; {11} means "match any of the
characters in that class exactly 11 times" (as opposed to other variations
on the '{}' syntax that say "at least m times", "at most n times", or
"between n and m times"). But that'll match all sorts of things that don't
look like 'dd hh:mm:ss':
term% egrep '[ :[:digit:]]{11}'
11111111111
11111111111
111111111
1111111111111
1111111111111
::::::::::::::::
::::::::::::::::
aaaa bbbbb
aaaa bbbbb
term%
(The first line is my typing; the second is output from egrep except for
the short line of 9 '1's, for which egrep had no output. That last two
lines are matching space characters and egrep echoing the match, but I'm
guessing gmail will eat those.)
Note that there are inputs with more than 11 characters that match; this is
because there is some 11-character substring that matches the RE in those
lines. In any event, I suspect this would generally not be what you want.
But if nothing else in your input can match the RE (which you might know a
priori because of domain knowledge about whatever is generating those logs)
then it's no big deal, even if the RE was capable of matching more things
generally.
> > Using character classes would greatly simplify what you're trying to
> > do. It seems like this could be simplified to (untested) snippet:
>
> Agreed.
>
> I'm starting with the examples that came with; "^\w{3} [
> :[:digit:]]{11}", the logcheck package that I'm working with and
> evaluating what I want to do.
Ah. I suspect this relies on domain knowledge about the format of log lines
to match reliably. Otherwise it could match, `___ 123 456:789` which is
probably not what you are expecting.
> I actually like the idea of dividing out the following:
>
> - months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, and Dec
> - months that have 30 days: Apr, Jun, Sep, Nov
> - month that have 28/29 days: Feb
Sure. One nice thing about `egrep` et al is that you can put the REs into
a file and include them with `-f`, as opposed to having them all directly
on the command line.
> > ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9]
>
> Aside: Why do you have the double square brackets in "[12][[0-9]]"?
Typo. :-)
> > For this, I'd probably eschew `[:digit:]`. Named character classes
> > are for handy locale support, or in lieu of typing every character
> > in the alphabet (though we can use ranges to abbreviate that), but
> > it kind of seems like that's not coming into play here and, IMHO,
> > `[0-9]` is clearer in context.
>
> ACK
>
> "[[:digit:]]+" was a construct that I'm parroting. It and
> [.:[:xdigit:]]+ are good for some things. But they definitely aren't
> the best for all things.
>
> Hence trying to find the line of being more accurate without going too
far.
>
> > It's not clear to me that dates, in their generality, can be
> > matched with regular expressions. Consider leap years; you'd almost
> > necessarily have to use backtracking for that, but I admit I haven't
> > thought it through.
>
> Given the context that these extended regular expressions are going to
> be used in, logcheck -- filtering out known okay log entries to email
> what doesn't get filtered -- I'm okay with having a few things slip
> through like leap day / leap seconds / leap frogs.
That seems reasonable.
> > `\w` is a GNU extension; I'd probably avoid it on portability grounds
> > (though `\b` is very handy).
>
> I hear, understand, and acknowledge your concern. At present, these
> filters are being used in a package; logcheck,
Aside: I found the note on it's website amusing: Brought to you by the UK's
best gambling sites! "Only gamble with what you can afford to lose." Yikes!
> which I believe is
> specific to Debian and ilk. As such, GNU grep is very much a thing.
I'd proceed with caution here; it also seems to be in the FreeBSD and
DragonFly ports collections and Homebrew on the Mac (but so is GNU grep for
all of those).
> I'm also not a fan of the use of `\w` and would prefer to (...|...)
things.
Yeah. IMHO `\w` is too general for what you're trying to do.
> > The thing about regular expressions is that they describe regular
> > languages, and regular languages are those for which there exists a
> > finite automaton that can recognize the language. An important class
> > of finite automata are deterministic finite automata; by definition,
> > recognition by such automata are linear in the length of the input.
> >
> > However, construction of a DFA for any given regular expression can be
> > superlinear (in fact, it can be exponential) so practically speaking,
> > we usually construct non-deterministic finite automata (NDFAs) and
> > "simulate" their execution for matching. NDFAs generalize DFAs (DFAs
> > are a subset of NDFAs, incidentally) in that, in any non-terminal
> > state, there can be multiple subsequent states that the machine can
> > transition to given an input symbol. When executed, for any state,
> > the simulator will transition to every permissible subsequent state
> > simultaneously, discarding impossible states as they become evident.
> >
> > This implies that NDFA execution is superlinear, but it is bounded,
> > and is O(n*m*e), where n is the length of the input, m is the number
> > of nodes in the state transition graph corresponding to the NDFA, and
> > e is the maximum number of edges leaving any node in that graph (for
> > a fully connected graph, that would m, so this can be up to O(n*m^2)).
> > Construction of an NDFA is O(m), so while it's slower to execute, it's
> > actually possible to construct in a reasonable amount of time. Russ's
> > excellent series of articles that Clem linked to gives details and
> > algorithms.
>
> I only vaguely understand those three paragraphs as they are deeper
> computer science than I've gone before.
>
> I think I get the gist of them but could not explain them if my life
> depended upon it.
Basically, a regular expression is a regular expression if you can build a
machine with no additional memory that can tell you whether or not a given
string matches the RE examining its input one character at a time.
> > In practical terms? Basically, don't worry about it too much. Egrep
> > will generate an NDFA simulation that's going to be acceptably fast
> > for all but the weirdest cases.
>
> ACK
>
> It sounds like I can make any reasonable extended regular expression a
> human can read and I'll probably be good.
I think that's about right.
> Thank you for the detailed response Dan. :-)
Sure thing!
- Dan C.
[-- Attachment #2: Type: text/html, Size: 11000 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-03 2:10 ` Dave Horsfall
@ 2023-03-03 3:34 ` Grant Taylor via COFF
0 siblings, 0 replies; 46+ messages in thread
From: Grant Taylor via COFF @ 2023-03-03 3:34 UTC (permalink / raw)
To: coff
[-- Attachment #1: Type: text/plain, Size: 299 bytes --]
On 3/2/23 7:10 PM, Dave Horsfall wrote:
> Because evil things can happen with URL redirectors; personally I like to
> know where I'm going beforehand...
I absolutely agree.
The confusion was why someone would purposefully choose to use a redirector.
--
Grant. . . .
unix || die
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4017 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-03 3:04 ` Dan Cross
@ 2023-03-03 3:53 ` Grant Taylor via COFF
2023-03-03 13:47 ` Dan Cross
0 siblings, 1 reply; 46+ messages in thread
From: Grant Taylor via COFF @ 2023-03-03 3:53 UTC (permalink / raw)
To: coff
[-- Attachment #1: Type: text/plain, Size: 5031 bytes --]
On 3/2/23 8:04 PM, Dan Cross wrote:
> I guess what I'm saying is, match what you want to match and don't sweat
> the small stuff.
ACK
> Not exactly. :-)
>
> What I understand you to mean, based on this and the rest of your note,
> is that you want to find a good division point between overly specific,
> complex REs and simpler, easy to understand REs that are less specific.
> The danger with the latter is that they may match things you don't
> intend, while the former are harder to maintain and (arguably) more
> brittle. I can sympathize.
You got it.
> For the purposes of grep/egrep, that'll be a logical "line" of text,
> terminated by a newline, though the newline itself isn't considered part
> of the text for matching. I believe the `-z` option can be used to set a
> NUL byte as the "line" terminator; presumably this lets one match
> strings with embedded newlines, though I haven't tried.
Fair enough. That's also sort of what I thought might be the case.
> "string" in this context is the input you're attempting to match
> against. `egrep` will attempt to match your pattern against each "line"
> of text it reads from the files its searching. That is, each line in
> your log file(s).
*nod*
> But consider what `[ :[:digit:]]{11}` means: you've got a character
> class consisting of space, colon and a digit; {11} means "match any of
> the characters in that class exactly 11 times" (as opposed to other
> variations on the '{}' syntax that say "at least m times", "at most n
> times", or "between n and m times").
Yep, I'm well aware of the that.
> But that'll match all sorts of things that don't look like 'dd
> hh:mm:ss':
That's one of the reasons that I'm interested in coming up with a more
precise regular expression ... without being overly complex.
> (The first line is my typing; the second is output from egrep except for
> the short line of 9 '1's, for which egrep had no output. That last two
> lines are matching space characters and egrep echoing the match, but I'm
> guessing gmail will eat those.)
>
> Note that there are inputs with more than 11 characters that match; this
> is because there is some 11-character substring that matches the RE in
> those lines. In any event, I suspect this would generally not be what
> you want. But if nothing else in your input can match the RE (which you
> might know a priori because of domain knowledge about whatever is
> generating those logs) then it's no big deal, even if the RE was capable
> of matching more things generally.
Yep.
Here's an example of the full RE:
^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+
postfix/msa/smtpd\[[[:digit:]]+\]: timeout after STARTTLS from
[._[:alnum:]-]+\[[.:[:xdigit:]]+\]$
As you can see the "[ :[:digit:]]{11}" is actually only a sub-part of a
larger RE and there is bounding & delimiting around the subpart.
This is to match a standard message from postfix via standard SYSLOG.
> Ah. I suspect this relies on domain knowledge about the format of log
> lines to match reliably. Otherwise it could match, `___ 123 456:789`
> which is probably not what you are expecting.
Yep.
Though said domain knowledge isn't anything special in and of itself.
> Sure. One nice thing about `egrep` et al is that you can put the REs
> into a file and include them with `-f`, as opposed to having them all
> directly on the command line.
Yep. logcheck makes extensive use of many files like this to do it's work.
> Typo. :-)
ACKK
> That seems reasonable.
Thank you for the logic CRC.
> Aside: I found the note on it's website amusing: Brought to you by the
> UK's best gambling sites! "Only gamble with what you can afford to
> lose." Yikes!
Um ... that's concerning.
> I'd proceed with caution here; it also seems to be in the FreeBSD and
> DragonFly ports collections and Homebrew on the Mac (but so is GNU grep
> for all of those).
Fair enough.
My use case is on Linux where GNU egrep is a thing.
> Yeah. IMHO `\w` is too general for what you're trying to do.
I think that `\w` is a good primer, but not where I want things to end
up long term.
> Basically, a regular expression is a regular expression if you can build
> a machine with no additional memory that can tell you whether or not a
> given string matches the RE examining its input one character at a time.
I /think/ that I could build a complex nested tree of switch statements
to test each character to see if things match what they should or not.
Though I would need at least one variable / memory to hold absolutely
minimal state to know where I am in the switch tree. I think a number
to identify the switch statement in question would be sufficient. So
I'm guessing two bytes of variable and uncounted bytes of program code.
> I think that's about right.
Thank you again Dan.
> Sure thing!
:-)
--
Grant. . . .
unix || die
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4017 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-02 18:54 [COFF] Requesting thoughts on extended regular expressions in grep Grant Taylor via COFF
2023-03-02 19:23 ` [COFF] " Clem Cole
2023-03-02 21:53 ` Dan Cross
@ 2023-03-03 10:59 ` Ralph Corderoy
2023-03-03 13:11 ` Dan Cross
2023-03-03 16:12 ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall
2023-03-03 19:06 ` Grant Taylor via COFF
2023-03-06 10:01 ` Ed Bradford
4 siblings, 2 replies; 46+ messages in thread
From: Ralph Corderoy @ 2023-03-03 10:59 UTC (permalink / raw)
To: coff
Hi Grant,
> What are the pros / cons to creating extended regular expressions like
> the following:
If you want to understand:
- the maths of regular expressions,
- the syntax of regexps which these days expresses more than REs, and
- the regexp engines in programs, the differences in how they work and
what they match, and
- how to efficiently steer an engine's internals
then I recommend Jeffrey Friedl's Mastering Regular Expressions.
http://regex.info/book.html
> For matching patterns like the following in log files?
>
> Mar 2 03:23:38
Do you want speed of matching with some false positives or validation by
regexp rather than post-lexing logic and to what depth, e.g. does this
month have a ‘31st’?
/^... .. ..:..:../
You'd said egrep, which is NDFA, but in other engines, alternation order
can matter, e.g. ‘J’ starts the most months and some months have more
days than others.
/^(J(an|u[nl])|Ma[ry]|A(ug|pr)|Oct|Dec|...
--
Cheers, Ralph.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-03 10:59 ` Ralph Corderoy
@ 2023-03-03 13:11 ` Dan Cross
2023-03-03 13:42 ` Ralph Corderoy
2023-03-03 16:12 ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall
1 sibling, 1 reply; 46+ messages in thread
From: Dan Cross @ 2023-03-03 13:11 UTC (permalink / raw)
To: Ralph Corderoy; +Cc: coff
On Fri, Mar 3, 2023 at 5:59 AM Ralph Corderoy <ralph@inputplus.co.uk> wrote:
> [snip]
>
> If you want to understand:
>
> - the maths of regular expressions,
> - the syntax of regexps which these days expresses more than REs, and
> - the regexp engines in programs, the differences in how they work and
> what they match, and
> - how to efficiently steer an engine's internals
>
> then I recommend Jeffrey Friedl's Mastering Regular Expressions.
> http://regex.info/book.html
I'm afraid I must sound a note of caution about Friedl's book. Russ
Cox alludes to some of the problems in the "History and References"
section of his page (https://swtch.com/~rsc/regexp/regexp1.html), that
was linked earlier, and he links to this post:
http://regex.info/blog/2006-09-15/248
The impression is that Friedl shows wonderfully how to _use_ regular
expressions, but does not understand the theory behind their
implementation.
It is certainly true that today what many people refer to as "regular
expressions" are not in fact regular (and require a pushdown automata
to implement, putting them somewhere between REs and the context-free
languages in terms of expressiveness).
Personally, I'd stick with Russ's stuff, especially as `egrep` is the
target here.
- Dan C.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-03 13:11 ` Dan Cross
@ 2023-03-03 13:42 ` Ralph Corderoy
2023-03-03 19:19 ` Grant Taylor via COFF
0 siblings, 1 reply; 46+ messages in thread
From: Ralph Corderoy @ 2023-03-03 13:42 UTC (permalink / raw)
To: coff
Hi Dan,
> > If you want to understand:
> >
> > - the maths of regular expressions,
> > - the syntax of regexps which these days expresses more than REs, and
> > - the regexp engines in programs, the differences in how they work
> > and what they match, and
> > - how to efficiently steer an engine's internals
> >
> > then I recommend Jeffrey Friedl's Mastering Regular Expressions.
> > http://regex.info/book.html
>
> I'm afraid I must sound a note of caution about Friedl's book. Russ
> Cox alludes to some of the problems in the "History and References"
> section of his page (https://swtch.com/~rsc/regexp/regexp1.html), that
> was linked earlier
Russ says:
1 ‘Finally, any discussion of regular expressions would be incomplete
without mentioning Jeffrey Friedl's book Mastering Regular Expressions,
perhaps the most popular reference among today's programmers.
2 Friedl's book teaches programmers how best to use today's regular
expression implementations, but not how best to implement them.
3 What little text it devotes to implementation issues perpetuates the
widespread belief that recursive backtracking is the only way to
simulate an NFA.
4 Friedl makes it clear that he [neither understands nor respects] the
underlying theory.’ http://regex.info/blog/2006-09-15/248
I think Grant is after what Russ addresses in sentence 2. :-)
> The impression is that Friedl shows wonderfully how to _use_ regular
> expressions, but does not understand the theory behind their
> implementation.
Yes, Friedl does show that wonderfully. From long-ago memory, Friedl
understands enough to have diagrams of NFAs and DFAs clocking through
their inputs, showing the differences in number of states, etc.
Yes, Friedl says an NFA must recursively backtrack. As Russ says in #3,
it was a ‘widespread belief’. Friedl didn't originate it; I ‘knew’ it
before reading his book. Friedl was at the sharp end of regexps,
needing to process large amounts of text, at Yahoo! IIRC. He
investigated how the programs available behaved; he didn't start at the
theory and come up with a new program best suited to his needs.
> Personally, I'd stick with Russ's stuff, especially as `egrep` is the
> target here.
Russ's stuff is great. He refuted that widespread belief, for one
thing. But Russ isn't trying to teach a programmer how to best use the
regexp engine in sed, grep, egrep, Perl, PCRE, ... whereas Friedl takes
the many pages needed to do this.
It depends what one wants to learn first.
As Friedl says in the post Russ linked to:
‘As a user, you don't care if it's regular, nonregular, unregular,
irregular, or incontinent. So long as you know what you can expect
from it (something this chapter will show you), you know all you need
to care about.
‘For those wishing to learn more about the theory of regular expressions,
the classic computer-science text is chapter 3 of Aho, Sethi, and
Ullman's Compilers — Principles, Techniques, and Tools (Addison-Wesley,
1986), commonly called “The Dragon Book” due to the cover design.
More specifically, this is the “red dragon”. The “green dragon”
is its predecessor, Aho and Ullman's Principles of Compiler Design.’
In addition to the Dragon Book, Hopcroft and Ullman's ‘Automata Theory,
Languages, and Computation’ goes further into the subject. Chapter two
has DFA, NFA, epsilon transitions, and uses searching text as an
example. Chapter three is regular expressions, four is regular
languages. Pushdown automata is chapter six.
Too many books, not enough time to read. :-)
--
Cheers, Ralph.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-03 3:53 ` Grant Taylor via COFF
@ 2023-03-03 13:47 ` Dan Cross
2023-03-03 19:26 ` Grant Taylor via COFF
0 siblings, 1 reply; 46+ messages in thread
From: Dan Cross @ 2023-03-03 13:47 UTC (permalink / raw)
To: Grant Taylor; +Cc: coff
On Thu, Mar 2, 2023 at 10:53 PM Grant Taylor via COFF <coff@tuhs.org> wrote:
>[snip
> Here's an example of the full RE:
>
> ^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+
> postfix/msa/smtpd\[[[:digit:]]+\]: timeout after STARTTLS from
> [._[:alnum:]-]+\[[.:[:xdigit:]]+\]$
>
> As you can see the "[ :[:digit:]]{11}" is actually only a sub-part of a
> larger RE and there is bounding & delimiting around the subpart.
Oh, for sure; to be clear, it was obvious that in the earlier
discussion the original was just part of something larger.
FWIW, this RE seems ok to me; the additional context makes it unlikely
to match something else accidentally.
> This is to match a standard message from postfix via standard SYSLOG.
>
> > Ah. I suspect this relies on domain knowledge about the format of log
> > lines to match reliably. Otherwise it could match, `___ 123 456:789`
> > which is probably not what you are expecting.
>
> Yep.
>
> Though said domain knowledge isn't anything special in and of itself.
It needn't be special. The point is simply that there's some external
knowledge that can be brought to bear to guide the shape of the REs.
In this case, you know that log lines won't begin with `___ 123
456:789` or other similar junk.
> [snip]
> > Basically, a regular expression is a regular expression if you can build
> > a machine with no additional memory that can tell you whether or not a
> > given string matches the RE examining its input one character at a time.
>
> I /think/ that I could build a complex nested tree of switch statements
> to test each character to see if things match what they should or not.
> Though I would need at least one variable / memory to hold absolutely
> minimal state to know where I am in the switch tree. I think a number
> to identify the switch statement in question would be sufficient. So
> I'm guessing two bytes of variable and uncounted bytes of program code.
Kinda. The "machine" in this case is actually an abstraction, like a
Turing machine. The salient point here is that REs map to finite state
machines, and in particular, one need not keep (say) a stack of prior
states when simulating them. Note that even in an NDFA simulation,
where one keeps track of what states one may be in, one doesn't need
to keep track of how one got into those states.
Obviously in a real implementation you've got the program counter,
register contents, local variables, etc, all of which consume "memory"
in the conventional sense. But the point is that you don't need
additional memory proportional to anything other than the size of the
RE. DFA implementation could be implemented entirely with `switch` and
`goto` if one wanted, as opposed to a bunch of mutually recursive
function calls, NDFA simulation similarly except that you need some
(bounded) additional memory to hold the active set of states. Contrast
this with a pushdown automata, which can parse a context-free
language, in which a stack is maintained that can store additional
information relative to the input (for example, an already seen
character). Pushdown automata can, for example, recognize matched
parenthesis while regular languages cannot.
Anyway, sorry, this is all rather more theoretical than is perhaps
interesting or useful. Bottom line is, I think your REs are probably
fine. `egrep` will complain at you if they are not, and I wouldn't
worry too much about optimizing them: I'd "stop" whenever you're happy
that you've got something understandable that matches what you want it
to match.
- Dan C.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-03 10:59 ` Ralph Corderoy
2023-03-03 13:11 ` Dan Cross
@ 2023-03-03 16:12 ` Dave Horsfall
2023-03-03 17:13 ` Dan Cross
2023-03-03 19:36 ` Grant Taylor via COFF
1 sibling, 2 replies; 46+ messages in thread
From: Dave Horsfall @ 2023-03-03 16:12 UTC (permalink / raw)
To: Computer Old Farts Followers
[-- Attachment #1: Type: text/plain, Size: 1185 bytes --]
On Fri, 3 Mar 2023, Ralph Corderoy wrote:
> You'd said egrep, which is NDFA, but in other engines, alternation order
> can matter, e.g. ‘J’ starts the most months and some months have more
> days than others.
>
> /^(J(an|u[nl])|Ma[ry]|A(ug|pr)|Oct|Dec|...
I can't help but provide an extract from my antispam log summariser (AWK):
# Yes, I have a warped sense of humour here.
/^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \
{
date = sprintf("%4d/%.2d/%.2d",
year, months[substr($0, 1, 3)], substr($0, 5, 2))
Etc. The idea is not to validate so much as to grab a line of interest to
me and extract the bits that I want.
In this case I trust the source (the Sendmail log), but of course that is
not always the case...
When doing things like this, you need to ask yourself at least the
following questions:
1) What exactly am I trying to do? This is fairly important :-)
2) Can I trust the data? Bobby Tables, Reflections on Trusting Trust...
3) Etc.
And let's not get started on the difference betwixt "trusted" and
"trustworthy" (that distinction keeps security bods awake at night).
-- Dave
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-03 16:12 ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall
@ 2023-03-03 17:13 ` Dan Cross
2023-03-03 17:38 ` Ralph Corderoy
2023-03-03 19:36 ` Grant Taylor via COFF
1 sibling, 1 reply; 46+ messages in thread
From: Dan Cross @ 2023-03-03 17:13 UTC (permalink / raw)
To: Dave Horsfall; +Cc: Computer Old Farts Followers
On Fri, Mar 3, 2023 at 11:12 AM Dave Horsfall <dave@horsfall.org> wrote:
> [snip]
> # Yes, I have a warped sense of humour here.
> /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \
> {
> date = sprintf("%4d/%.2d/%.2d",
> year, months[substr($0, 1, 3)], substr($0, 5, 2))
If I may, I'd like to point out something fairly subtle here that, I
think, bears on the original question (paraphrased as, "where does one
draw the line between concision and understandability?").
Note Dave's class to match the first letter of the month:
`[JFMAMJJASOND]`. One may notice that a few letters are repeated (J,
M, A), and one _could_ shorten this to: `[JFMASOND]`. But I can see a
serious argument where that may be regarded as a mistake; in
particular, the original is easy to validate by just saying the names
of the month out loud as one scans the list. For the shorter version,
I'd worry that I would miss something or make a mistake. The lesson
here is keep it simple and don't over-optimize!
> Etc. The idea is not to validate so much as to grab a line of interest to
> me and extract the bits that I want.
> [snip]
Too true.
A few years ago, Rob Pike gave a talk about lexing in Go that bears on
this that's worth a listen:
https://www.youtube.com/watch?v=HxaD_trXwRE
- Dan C.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-03 17:13 ` Dan Cross
@ 2023-03-03 17:38 ` Ralph Corderoy
2023-03-03 19:09 ` Dan Cross
0 siblings, 1 reply; 46+ messages in thread
From: Ralph Corderoy @ 2023-03-03 17:38 UTC (permalink / raw)
To: coff
Hi,
> Dave Horsfall <dave@horsfall.org> wrote:
> > # Yes, I have a warped sense of humour here.
> > /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \
...
> in particular, the original is easy to validate by just saying the
> names of the month out loud as one scans the list.
Some clients pay me to read code and find fault. It's a hard habit to
break. ‘coc’ smells wrong. :-)
A bit of vi's :map later...
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dcc
The regexp works, of course, but in this case removing the redundancy
would also fix the ‘fault’.
--
Cheers, Ralph.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-02 18:54 [COFF] Requesting thoughts on extended regular expressions in grep Grant Taylor via COFF
` (2 preceding siblings ...)
2023-03-03 10:59 ` Ralph Corderoy
@ 2023-03-03 19:06 ` Grant Taylor via COFF
2023-03-03 19:31 ` Dan Cross
2023-03-04 10:07 ` Ralph Corderoy
2023-03-06 10:01 ` Ed Bradford
4 siblings, 2 replies; 46+ messages in thread
From: Grant Taylor via COFF @ 2023-03-03 19:06 UTC (permalink / raw)
To: coff
[-- Attachment #1: Type: text/plain, Size: 1241 bytes --]
Thank you all for very interesting and engaging comments & threads to
chase / pull / untangle.
I'd like to expand / refine my original question a little bit.
On 3/2/23 11:54 AM, Grant Taylor via COFF wrote:
> I'd like some thoughts ~> input on extended regular expressions used
> with grep, specifically GNU grep -e / egrep.
While some reading of the references that Clem provided I came across
multiple indications that back-references can be problematic from a
performance stand point.
So I'd like to know if all back-references are problematic, or if very
specific back-references are okay.
Suppose I have the following two lines:
aaa aaa
aaa bbb
Does the following RE w/ back-reference introduce a big performance penalty?
(aaa|bbb) \1
As in:
% echo "aaa aaa" | egrep "(aaa|bbb) \1"
aaa aaa
I can easily see how a back reference to something that is not a fixed
length can become a rabbit hole. But I'm wondering if a back-reference
to -- what I think is called -- an alternation (with length fixed in the
RE) is a performance hit or not.
Now to read and reply to the many good comments that people have shared.
:-)
--
Grant. . . .
unix || die
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4017 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-03 17:38 ` Ralph Corderoy
@ 2023-03-03 19:09 ` Dan Cross
0 siblings, 0 replies; 46+ messages in thread
From: Dan Cross @ 2023-03-03 19:09 UTC (permalink / raw)
To: Ralph Corderoy; +Cc: coff
On Fri, Mar 3, 2023 at 12:39 PM Ralph Corderoy <ralph@inputplus.co.uk> wrote:
> > Dave Horsfall <dave@horsfall.org> wrote:
> > > # Yes, I have a warped sense of humour here.
> > > /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \
> ...
> > in particular, the original is easy to validate by just saying the
> > names of the month out loud as one scans the list.
>
> Some clients pay me to read code and find fault. It's a hard habit to
> break. ‘coc’ smells wrong. :-)
>
> A bit of vi's :map later...
>
> Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dcc
>
> The regexp works, of course, but in this case removing the redundancy
> would also fix the ‘fault’.
Ha! Good catch. I'd probably just write it as,
`(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)` which isn't much
longer than the original anyway.
- Dan C.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-03 13:42 ` Ralph Corderoy
@ 2023-03-03 19:19 ` Grant Taylor via COFF
2023-03-04 10:15 ` [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) Ralph Corderoy
0 siblings, 1 reply; 46+ messages in thread
From: Grant Taylor via COFF @ 2023-03-03 19:19 UTC (permalink / raw)
To: coff
[-- Attachment #1: Type: text/plain, Size: 3590 bytes --]
On 3/3/23 6:42 AM, Ralph Corderoy wrote:
> I think Grant is after what Russ addresses in sentence 2. :-)
You are mostly correct. The motivation for this thread is very much so
wanting to learn "how best to use today's regular expression
implementations". However there is also the part of me that wants to
have a little bit of understanding behind why the former is the case.
> Yes, Friedl does show that wonderfully. From long-ago memory, Friedl
> understands enough to have diagrams of NFAs and DFAs clocking through
> their inputs, showing the differences in number of states, etc.
It seems like I need to find another copy of Friedl's book. -- My
current copy is boxed up for a move nearly 1k miles away. :-/
> Yes, Friedl says an NFA must recursively backtrack. As Russ says in #3,
> it was a ‘widespread belief’. Friedl didn't originate it; I ‘knew’ it
> before reading his book. Friedl was at the sharp end of regexps,
> needing to process large amounts of text, at Yahoo! IIRC. He
> investigated how the programs available behaved; he didn't start at the
> theory and come up with a new program best suited to his needs.
It sounds like I'm coming from a similar position of "what is the best*
way to process this corpus" more than "what is the underlying theory
behind what I'm wanting to do".
> Russ's stuff is great. He refuted that widespread belief, for one
> thing. But Russ isn't trying to teach a programmer how to best use the
> regexp engine in sed, grep, egrep, Perl, PCRE, ... whereas Friedl takes
> the many pages needed to do this.
:-)
> It depends what one wants to learn first.
I'm learning that I'm more of a technician that wants to know how to use
the existing tools to the best of his / their ability. While having
some interest in theory behind things.
> As Friedl says in the post Russ linked to:
>
> ‘As a user, you don't care if it's regular, nonregular, unregular,
> irregular, or incontinent. So long as you know what you can expect
> from it (something this chapter will show you), you know all you need
> to care about.
Yep. That's the position that I would be in if someone were paying me
to write the REs that I'm writing.
> ‘For those wishing to learn more about the theory of regular expressions,
> the classic computer-science text is chapter 3 of Aho, Sethi, and
> Ullman's Compilers — Principles, Techniques, and Tools (Addison-Wesley,
> 1986), commonly called “The Dragon Book” due to the cover design.
> More specifically, this is the “red dragon”. The “green dragon”
> is its predecessor, Aho and Ullman's Principles of Compiler Design.’
This all sounds interesting to me, and like something I might add to my
collection of books. But it also sounds like something that will be an
up hill read and vast learning opportunity.
> In addition to the Dragon Book, Hopcroft and Ullman's ‘Automata Theory,
> Languages, and Computation’ goes further into the subject. Chapter two
> has DFA, NFA, epsilon transitions, and uses searching text as an
> example. Chapter three is regular expressions, four is regular
> languages. Pushdown automata is chapter six.
>
> Too many books, not enough time to read. :-)
Yep. Even inventorying and keeping track of the books can be time
consuming. -- Thankfully I took some time to do exactly that and have
access to that information on the super computer in my pocket.
--
Grant. . . .
unix || die
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4017 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-03 13:47 ` Dan Cross
@ 2023-03-03 19:26 ` Grant Taylor via COFF
0 siblings, 0 replies; 46+ messages in thread
From: Grant Taylor via COFF @ 2023-03-03 19:26 UTC (permalink / raw)
To: coff
[-- Attachment #1: Type: text/plain, Size: 3045 bytes --]
On 3/3/23 6:47 AM, Dan Cross wrote:
> Oh, for sure; to be clear, it was obvious that in the earlier
> discussion the original was just part of something larger.
Good. For a moment I thought that you might be thinking it was stand alone.
> FWIW, this RE seems ok to me; the additional context makes it unlikely
> to match something else accidentally.
:-)
> It needn't be special. The point is simply that there's some external
> knowledge that can be brought to bear to guide the shape of the REs.
ACK
I've heard "domain (specific) knowledge" used to refer to both extremely
specific training in a field and -- as you have -- data that is having
something done to it.
> In this case, you know that log lines won't begin with `___ 123
> 456:789` or other similar junk.
They darned well had better not.
> Kinda. The "machine" in this case is actually an abstraction, like a
> Turing machine. The salient point here is that REs map to finite state
> machines, and in particular, one need not keep (say) a stack of prior
> states when simulating them. Note that even in an NDFA simulation,
> where one keeps track of what states one may be in, one doesn't need
> to keep track of how one got into those states.
ACK
> Obviously in a real implementation you've got the program counter,
> register contents, local variables, etc, all of which consume
> "memory" in the conventional sense. But the point is that you don't
> need additional memory proportional to anything other than the size
> of the RE. DFA implementation could be implemented entirely with
> `switch` and `goto` if one wanted, as opposed to a bunch of mutually
> recursive function calls, NDFA simulation similarly except that
> you need some (bounded) additional memory to hold the active set
> of states. Contrast this with a pushdown automata, which can parse
> a context-free language, in which a stack is maintained that can
> store additional information relative to the input (for example,
> an already seen character). Pushdown automata can, for example,
> recognize matched parenthesis while regular languages cannot.
I think I understand the gist of what you're saying, but I need to
re-read it and think about it a little bit.
> Anyway, sorry, this is all rather more theoretical than is perhaps
> interesting or useful.
Apology returned to sender as unnecessary.
You are providing the requested thought provoking discussion, which is
exactly what I asked for. I feel like I'm going to walk away from this
thread wiser based on the thread's content plus all additional reading
material on top of the thread itself.
> Bottom line is, I think your REs are probably fine. `egrep` will
> complain at you if they are not, and I wouldn't worry too much about
> optimizing them: I'd "stop" whenever you're happy that you've got
> something understandable that matches what you want it to match.
Thank you (again) Dan. :-)
--
Grant. . . .
unix || die
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4017 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-03 19:06 ` Grant Taylor via COFF
@ 2023-03-03 19:31 ` Dan Cross
2023-03-04 10:07 ` Ralph Corderoy
1 sibling, 0 replies; 46+ messages in thread
From: Dan Cross @ 2023-03-03 19:31 UTC (permalink / raw)
To: Grant Taylor; +Cc: coff
On Fri, Mar 3, 2023 at 2:06 PM Grant Taylor via COFF <coff@tuhs.org> wrote:
> Thank you all for very interesting and engaging comments & threads to
> chase / pull / untangle.
>
> I'd like to expand / refine my original question a little bit.
>
> On 3/2/23 11:54 AM, Grant Taylor via COFF wrote:
> > I'd like some thoughts ~> input on extended regular expressions used
> > with grep, specifically GNU grep -e / egrep.
>
> While some reading of the references that Clem provided I came across
> multiple indications that back-references can be problematic from a
> performance stand point.
>
> So I'd like to know if all back-references are problematic, or if very
> specific back-references are okay.
The thing about backreferences is that they're not representable in
the regular languages because they require additional state (the thing
the backref refers to), so you cannot construct a DFA corresponding to
them, nor an NDFA simulator (this is where Freidl gets things wrong!);
you really need a pushdown automata and then you're in the domain of
the context-free languages. Therefore, "regexps" that use back
references are not actually regular expressions.
Yet, popular engines support them...but how? Well, pretty much all of
them use a backtracking implementation, which _can_ be exponential in
both time and space.
Now, that said, there are plenty of REs, even some with backrefs,
that'll execute plenty fast enough on backtracking implementations; it
really depends on the expressions in question and the size of strings
you're trying to match against. But you lose the bounding guarantees
DFAs and NDFAs provide.
> Suppose I have the following two lines:
>
> aaa aaa
> aaa bbb
>
> Does the following RE w/ back-reference introduce a big performance penalty?
>
> (aaa|bbb) \1
>
> As in:
>
> % echo "aaa aaa" | egrep "(aaa|bbb) \1"
> aaa aaa
>
> I can easily see how a back reference to something that is not a fixed
> length can become a rabbit hole. But I'm wondering if a back-reference
> to -- what I think is called -- an alternation (with length fixed in the
> RE) is a performance hit or not.
Well, it's more about the implementation strategy than the specific
expression here. Could this become exponential? I don't think this one
would, no; but others may, particularly if you use Kleene closures in
the alternation.
This _is_ something that appears in the wild, by the way, not just in
theory; I did a change to Google's spelling service code to replace
PCRE with re2 precisely because it was blowing up with exponential
memory usage on some user input. The problems went away, but I had to
rewrite a bunch of the REs involved.
- Dan C.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-03 16:12 ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall
2023-03-03 17:13 ` Dan Cross
@ 2023-03-03 19:36 ` Grant Taylor via COFF
2023-03-04 10:26 ` Ralph Corderoy
1 sibling, 1 reply; 46+ messages in thread
From: Grant Taylor via COFF @ 2023-03-03 19:36 UTC (permalink / raw)
To: coff
[-- Attachment #1: Type: text/plain, Size: 2055 bytes --]
On 3/3/23 9:12 AM, Dave Horsfall wrote:
> I can't help but provide an extract from my antispam log summariser
> (AWK):
>
> # Yes, I have a warped sense of humour here.
> /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \
> {
> date = sprintf("%4d/%.2d/%.2d",
> year, months[substr($0, 1, 3)], substr($0, 5, 2))
Thank you for sharing that Dave.
> Etc. The idea is not to validate so much as to grab a line of interest
> to me and extract the bits that I want.
Fair enough.
Using bracket expressions for the three letters is definitely another
idea that I hadn't considered.
But I believe I like what I think is -- what I'm going to describe as --
the more precise alternation listing out each month. (Jan|Feb|Mar...
Such an alternation is not going to match Jer like the three bracket
expressions will. I also believe that the alternation will be easier to
maintain in the future. Especially by someone other than me that has
less experience with REs.
> In this case I trust the source (the Sendmail log), but of course
> that is not always the case...
I trust that syslog will produce consistent line beginnings more than I
trust the data that is provided to syslog. But I'd still like to be
able to detect "Jer" or "Dot" if syslog ever tosses it's cookies.
> When doing things like this, you need to ask yourself at least the
> following questions:
>
> 1) What exactly am I trying to do? This is fairly important :-)
Filter out known to be okay log entries.
> 2) Can I trust the data? Bobby Tables, Reflections on Trusting
> Trust...
Given that I'm effectively negating things and filtering out log entries
that I want to not see (because they are okay) I'm comfortable with
trusting the data from syslog.
Brown M&Ms come to mind.
> 3) Etc.
>
> And let's not get started on the difference betwixt "trusted" and
> "trustworthy" (that distinction keeps security bods awake at night).
ACK
--
Grant. . . .
unix || die
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4017 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-03 19:06 ` Grant Taylor via COFF
2023-03-03 19:31 ` Dan Cross
@ 2023-03-04 10:07 ` Ralph Corderoy
1 sibling, 0 replies; 46+ messages in thread
From: Ralph Corderoy @ 2023-03-04 10:07 UTC (permalink / raw)
To: coff
Hi Grant,
> Suppose I have the following two lines:
>
> aaa aaa
> aaa bbb
>
> Does the following RE w/ back-reference introduce a big performance
> penalty?
>
> (aaa|bbb) \1
>
> As in:
>
> % echo "aaa aaa" | egrep "(aaa|bbb) \1"
> aaa aaa
You could measure the number of CPU instructions and experiment.
$ echo xyzaaa aaaxyz >f
$ ticks() { LC_ALL=C perf stat -e instructions egrep "$@"; }
$
$ ticks '(aaa|bbb) \1' <f
xyzaaa aaaxyz
Performance counter stats for 'egrep (aaa|bbb) \1':
2790889 instructions:u
0.009146904 seconds time elapsed
0.009178000 seconds user
0.000000000 seconds sys
$
Bear in mind that egreps differ, even within GNU egrep, say, over time.
$ LC_ALL=C perf stat -e instructions egrep '(aaa|bbb) \1' f
xyzaaa aaaxyz
...
2795836 instructions:u
...
$ LC_ALL=C perf stat -e instructions perl -ne '/(aaa|bbb) \1/ and print' f
xyzaaa aaaxyz
...
2563488 instructions:u
...
$ LC_ALL=C perf stat -e instructions sed -nr '/(aaa|bbb) \1/p' f
xyzaaa aaaxyz
...
610213 instructions:u
...
$
--
Cheers, Ralph.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)
2023-03-03 19:19 ` Grant Taylor via COFF
@ 2023-03-04 10:15 ` Ralph Corderoy
2023-03-07 21:49 ` [COFF] " Tomasz Rola
2023-06-20 16:02 ` Michael Parson
0 siblings, 2 replies; 46+ messages in thread
From: Ralph Corderoy @ 2023-03-04 10:15 UTC (permalink / raw)
To: coff
Hi,
Grant wrote:
> Even inventorying and keeping track of the books can be time
> consuming. -- Thankfully I took some time to do exactly that and
> have access to that information on the super computer in my pocket.
I seek recommendations for an Android app to comfortably read PDFs on a
mobile phone's screen. They were intended to be printed as a book. In
particular, once I've zoomed and panned to get the interesting part of a
page as large as possible, swiping between pages should persist that
view. An extra point for allowing odd and even pages to use different
panning.
--
Cheers, Ralph.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-03 19:36 ` Grant Taylor via COFF
@ 2023-03-04 10:26 ` Ralph Corderoy
0 siblings, 0 replies; 46+ messages in thread
From: Ralph Corderoy @ 2023-03-04 10:26 UTC (permalink / raw)
To: coff
Hi Grant,
> the more precise alternation listing out each month. (Jan|Feb|Mar...
For those regexp engines which test each alternative in turn, ordering
the months most-frequent first would give a slight win. :-) It really
is a rabbit hole once you start. Typically not worth entering, but it
can be fun if you like that kind of thing.
> I trust that syslog will produce consistent line beginnings more than
> I trust the data that is provided to syslog. But I'd still like to be
> able to detect "Jer" or "Dot" if syslog ever tosses it's cookies.
You could develop your regexps to find lines of interest and then flip
them about, e.g. egrep's -v, to see what lines are missed and consider
if any are interesting. Repeat. But this happens at development time.
Or at run time, you can have a ‘loose’ regexp to let all expected lines
in through the door and then match with one or more ‘tight’ regexps,
baulking if none do.
There's no right answer in general.
--
Cheers, Ralph.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-02 18:54 [COFF] Requesting thoughts on extended regular expressions in grep Grant Taylor via COFF
` (3 preceding siblings ...)
2023-03-03 19:06 ` Grant Taylor via COFF
@ 2023-03-06 10:01 ` Ed Bradford
2023-03-06 21:01 ` Dan Cross
4 siblings, 1 reply; 46+ messages in thread
From: Ed Bradford @ 2023-03-06 10:01 UTC (permalink / raw)
To: Grant Taylor; +Cc: COFF
[-- Attachment #1: Type: text/plain, Size: 5314 bytes --]
Thanks, Grant and contributors in
this thread,
Great thread on RE's. I bought and read
the book (it's on the floor over there
in the corner and I'm not getting up).
My task was finding dates in binary
and text files. It turns out RE's work just
fine for that. Because I was looking at
both text files and binary files, I
wrote my stuff using 8-bit python
"bytes" rather than python "text" which
is, I think, 7-bit in python. (I use
python because it works on both
Linux, Macs and Windows and reduces the
number of RE implementations I have
to deal with to 1).
I finished my first round of the
program late fall of 2022. Then
I put it down and now I am
revisiting it. I was creating:
A Python program to search for
media files (pictures and movies)
and copy them to another
directory tree, copying only the
unique ones (deduplication), and
renaming each with
*YYYY-MM-DD-*
as a prefix.
Here is a list of observations from my
programming.
1. RE's are quite unreadable. I defined
a lot of python variables and simply
added them together in python to make
a larger byte string (see below).
The resulting
expressions were shorter on screen
and more readable. Furthermore,
I could construct them incrementally.
I insist on readable code
because I frequently put things down
for a month or more. A while back
it was a sad day when I restarted
something and simply had to throw it
away, moaning, "What was that
programmer thinking?".
Here is an example RE for
YYYY-MM-DD
# FR = front BA = back
# ymdt is text version
ymdt = FRSEP + Y_ + SEP + M_ + SEP + D_ + BASEP
ymdc = re.compile( ymdt )
1a. I also had a time defining
delimiters. There are delimiters
for the beginning, delimiters
for internal separation,
and delimiters for the end.
The significant thing is I have
to find the RE if it is the very
first string in the file or the
very last. That also complicates
buffered reading immensely. Hence, I wrote
the whole program by reading the
file into a single python variable.
However, when files become much
larger than memory, python simply
ground to a halt as did my Windows
machine. I then rewrote it using a
memory mapped file (for all files)
and the problem was fixed.
2. Dates are formatted in a number of
ways. I chose exactly one
format to learn about RE's
and how to construct them and use
them. Even the book didn't elaborate
everything. I could not find
detailed documentation on some of
the interfaces in the book.
On a whim, I asked chatGPT
to write a python module that returns
a list of offsets and dates in a file.
Surprisingly, it wrote one that was
quite credible. It had bugs but it
knew more about how to use the various
functional interfaces in RE's than I
did.
3. Testing an RE is maybe even more
difficult than writing one. I have
not given any serious effort to
verification testing yet.
I would like to extend my program to
any date format. That would require
a much bigger RE. I have been led to
believe that a 50Kbyte or 500Kbyte
RE works just as well (if not
as fast) as a 100 byte RE. I think
with parentheses and
pipe-symbols suitably used,
one could match
Monday, March 6, 2023
2023-03-06
Mar 6, 2023
or
...
I'm just guessing, though. This
thread has been very informative.
I have much to read.
Thank all of you.
Ed Bradford
Pflugerville, TX
On Thu, Mar 2, 2023 at 12:55 PM Grant Taylor via COFF <coff@tuhs.org> wrote:
> Hi,
>
> I'd like some thoughts ~> input on extended regular expressions used
> with grep, specifically GNU grep -e / egrep.
>
> What are the pros / cons to creating extended regular expressions like
> the following:
>
> ^\w{3}
>
> vs:
>
> ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
>
> Or:
>
> [ :[:digit:]]{11}
>
> vs:
>
> ( 1| 2| 3| 4| 5| 6| 7| 8|
> 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31)
> (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]
>
> I'm currently eliding the 61st (60) second, the 32nd day, and dealing
> with February having fewer days for simplicity.
>
> For matching patterns like the following in log files?
>
> Mar 2 03:23:38
>
> I'm working on organically training logcheck to match known good log
> entries. So I'm *DEEP* in the bowels of extended regular expressions
> (GNU egrep) that runs over all logs hourly. As such, I'm interested in
> making sure that my REs are both efficient and accurate or at least not
> WILDLY badly structured. The pedantic part of me wants to avoid
> wildcard type matches (\w), even if they are bounded (\w{3}), unless it
> truly is for unpredictable text.
>
> I'd appreciate any feedback and recommendations from people who have
> been using and / or optimizing (extended) regular expressions for longer
> than I have been using them.
>
> Thank you for your time and input.
>
>
>
> --
> Grant. . . .
> unix || die
>
>
--
Advice is judged by results, not by intentions.
Cicero
[-- Attachment #2: Type: text/html, Size: 6600 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-06 10:01 ` Ed Bradford
@ 2023-03-06 21:01 ` Dan Cross
2023-03-06 21:49 ` Steffen Nurpmeso
` (2 more replies)
0 siblings, 3 replies; 46+ messages in thread
From: Dan Cross @ 2023-03-06 21:01 UTC (permalink / raw)
To: Ed Bradford; +Cc: Grant Taylor, COFF
On Mon, Mar 6, 2023 at 5:02 AM Ed Bradford <egbegb2@gmail.com> wrote:
>[snip]
> I would like to extend my program to
> any date format. That would require
> a much bigger RE. I have been led to
> believe that a 50Kbyte or 500Kbyte
> RE works just as well (if not
> as fast) as a 100 byte RE. I think
> with parentheses and
> pipe-symbols suitably used,
> one could match
>
> Monday, March 6, 2023
> 2023-03-06
> Mar 6, 2023
> or
> ...
This reminds me of something that I wanted to bring up.
Perhaps one _could_ define a sufficiently rich regular expression that
one could match a number of date formats. However, I submit that one
_should not_. REs may be sufficiently powerful, but in all likelihood
what you'll end up with is an unreadable mess; it's like people who
abuse `sed` or whatever to execute complex, general purpose programs:
yeah, it's a clever hack, but that doesn't mean you should do it.
Pick the right tool for the job. REs are a powerful tool, but they're
not the right tool for _every_ job, and I'd argue that once you hit a
threshold of complexity that'll be mostly self-evident, it's time to
move on to something else.
As for large vs small REs.... When we start talking about differences
of orders of magnitude in size, we start talking about real
performance implications; in general an NDFA simulation of a regular
expression will have on the order of the length of the RE in states,
so when the length of the RE is half a million symbols, that's
half-a-million states, which practically speaking is a pretty big
number, even though it's bounded is still a pretty big number, and
even on modern CPUs.
I wouldn't want to poke that bear.
- Dan C.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-06 21:01 ` Dan Cross
@ 2023-03-06 21:49 ` Steffen Nurpmeso
2023-03-07 1:43 ` Larry McVoy
2023-03-07 4:19 ` Ed Bradford
2 siblings, 0 replies; 46+ messages in thread
From: Steffen Nurpmeso @ 2023-03-06 21:49 UTC (permalink / raw)
To: Dan Cross; +Cc: Grant Taylor, COFF
Dan Cross wrote in
<CAEoi9W6tZ+55MSPxPoZqfS3k9RO9MOQqB0yu=MO_vzzw0K6Lhw@mail.gmail.com>:
|On Mon, Mar 6, 2023 at 5:02 AM Ed Bradford <egbegb2@gmail.com> wrote:
|>[snip]
|> I would like to extend my program to
|> any date format. That would require
|> a much bigger RE. I have been led to
...
|> one could match
|>
|> Monday, March 6, 2023
|> 2023-03-06
|> Mar 6, 2023
|> or
...
|This reminds me of something that I wanted to bring up.
Me too. If it becomes something regular and stable maybe turn
into a dedicated parser. (As a lex yacc bison byacc refuser, but
these surely can too.)
|Perhaps one _could_ define a sufficiently rich regular expression that
|one could match a number of date formats. However, I submit that one
|_should not_. REs may be sufficiently powerful, but in all likelihood
...
Kurt Shoens implemented some date template parser for BSD Mail in
about 1980 that was successively changed many years later by
Edward Wang in 1988 ([1] commit
[309eb459e35f77985851ce143ad2f9da5f0d90da], 1988-07-08 18:41:33
-0800). There is strftime(3), but it came later than both to
CSRG, and the Wang thing (in usr.bin/mail/head.c) is a dedicated
thing.
(Ie
/* Template characters for cmatch_data.tdata:
* 'A' An upper case char
* 'a' A lower case char
* ' ' A space
* '0' A digit
* 'O' An optional digit or space; MUST be followed by '0space'!
* ':' A colon
* '+' Either a plus or a minus sign */
and then according strings like "Aaa Aaa O0 00:00:00 0000".)
[1] https://github.com/robohack/ucb-csrg-bsd.git
--steffen
|
|Der Kragenbaer, The moon bear,
|der holt sich munter he cheerfully and one by one
|einen nach dem anderen runter wa.ks himself off
|(By Robert Gernhardt)
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-06 21:01 ` Dan Cross
2023-03-06 21:49 ` Steffen Nurpmeso
@ 2023-03-07 1:43 ` Larry McVoy
2023-03-07 4:01 ` Ed Bradford
2023-03-07 4:19 ` Ed Bradford
2 siblings, 1 reply; 46+ messages in thread
From: Larry McVoy @ 2023-03-07 1:43 UTC (permalink / raw)
To: Dan Cross; +Cc: Grant Taylor, COFF
On Mon, Mar 06, 2023 at 04:01:51PM -0500, Dan Cross wrote:
> On Mon, Mar 6, 2023 at 5:02???AM Ed Bradford <egbegb2@gmail.com> wrote:
> >[snip]
> > I would like to extend my program to
> > any date format. That would require
> > a much bigger RE. I have been led to
> > believe that a 50Kbyte or 500Kbyte
> > RE works just as well (if not
> > as fast) as a 100 byte RE. I think
> > with parentheses and
> > pipe-symbols suitably used,
> > one could match
> >
> > Monday, March 6, 2023
> > 2023-03-06
> > Mar 6, 2023
> > or
> > ...
>
> This reminds me of something that I wanted to bring up.
>
> Perhaps one _could_ define a sufficiently rich regular expression that
> one could match a number of date formats. However, I submit that one
> _should not_. REs may be sufficiently powerful, but in all likelihood
> what you'll end up with is an unreadable mess; it's like people who
> abuse `sed` or whatever to execute complex, general purpose programs:
> yeah, it's a clever hack, but that doesn't mean you should do it.
Dan, I agree with you. I ran a software company for almost 20 years
and the main thing I contributed was "lets be dumb". Lets write code
that is easy to read, easy to bug fix.
Smart engineers love to be clever, they would be the folks that wrote
those long RE that worked magic. But that magic was something they
understood and nobody else did.
Less is more. Less is easy to support.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-07 1:43 ` Larry McVoy
@ 2023-03-07 4:01 ` Ed Bradford
2023-03-07 11:39 ` [COFF] " Ralph Corderoy
2023-03-07 16:14 ` Dan Cross
0 siblings, 2 replies; 46+ messages in thread
From: Ed Bradford @ 2023-03-07 4:01 UTC (permalink / raw)
To: Larry McVoy; +Cc: Grant Taylor, COFF
[-- Attachment #1: Type: text/plain, Size: 2560 bytes --]
I have made an attempt to make my RE stuff readable and supportable. I
think I write more description that I do RE "code". As for, *it won't be
comprehendable,* Machine language
was unreadable and then along came assembly language. Assembly language was
unreadable, then came higher level languages. Even higher level languages
are unsupportable if not well documented and mostly simple to understand
("you are not expected to understand this" notwithstanding). The jump from
machine language to python today
was unimagined in early times.
[
As an old timer, I see inflection points
between:
machine language and assembly language
assembly language and high level languages
and
high level languages and python.
But that's just me.
]
I think it is possible to make a 50K RE that is understandable. However, it
requires
a lot of 'splainin' throughout the code. I'm naive though; I will
eventually discover
a lack of truth in that belief, if such exists.
I repeat. I put stuff down for months at a time. My metric is *coming back
to it*
*and understanding where I left off*. So far, I can do that for this RE
program that
works for small files, large files,
binary files and text files for exactly one pattern:
YYYY[-MM-DD]
I constructed this RE with code like this:
# ymdt is YYYY-MM-DD RE in text.
# looking only for 1900s and 2000s years and no later than today.
_YYYY = "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE) + "]" + "){1}"
# months
_MM = "(0[1-9]|1[012])"
# days
_DD = "(0[1-9]|[12]\d|3[01])"
ymdt = _YYYY + '[' + _INTERNALSEP +
_MM +
_INTERNALSEP +
']'{0,1)
For the whole file, RE I used
ymdthf = _FRSEP + ymdt + _BASEP
where FRSEP is front separator which includes
a bunch of possible separators, excluding numbers and letters, or-ed
with the up arrow "beginning of line" RE mark. BASEP is back separator
is same as FRSEP with "^" replaced with "$".
I then aimed ymdthf at "data" the thing that represents
the entire memory mapped file (where there is only one beginning
and one end).
Again, I say validating an RE is as difficult or more than writing one.
What does it miss?
Dates are an excellent test ground for RE's. Latitude and longitude is
another.
Ed
PS: I thought I was on the COFF mailing list. I received this email
by direct mail to from Larry. I haven't seen any other comments
on my submission. I might have unsubscribed, but now I regret it. Dear
powers
that be: Please resubscribe me.
[-- Attachment #2: Type: text/html, Size: 8375 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-06 21:01 ` Dan Cross
2023-03-06 21:49 ` Steffen Nurpmeso
2023-03-07 1:43 ` Larry McVoy
@ 2023-03-07 4:19 ` Ed Bradford
2 siblings, 0 replies; 46+ messages in thread
From: Ed Bradford @ 2023-03-07 4:19 UTC (permalink / raw)
To: Dan Cross; +Cc: Grant Taylor, COFF
[-- Attachment #1: Type: text/plain, Size: 2509 bytes --]
Hi Dan,
It sounds to me like an "optimizer" is needed. There is alreay a compiler
that uses FA's. Is someone else going to create a program
to look for dates without using regular expressions?
Today, I write small-sized RE's. If I write a giant RE, there is nothing
preventing
the owner of RE world to change how they are used. For instance. Compile
your RE
and a subroutine/function is produced that performs the RE search.
RE is a *language*, not necessarily an implementation.
At least that is my understanding.
Ed
On Mon, Mar 6, 2023 at 3:02 PM Dan Cross <crossd@gmail.com> wrote:
> On Mon, Mar 6, 2023 at 5:02 AM Ed Bradford <egbegb2@gmail.com> wrote:
> >[snip]
> > I would like to extend my program to
> > any date format. That would require
> > a much bigger RE. I have been led to
> > believe that a 50Kbyte or 500Kbyte
> > RE works just as well (if not
> > as fast) as a 100 byte RE. I think
> > with parentheses and
> > pipe-symbols suitably used,
> > one could match
> >
> > Monday, March 6, 2023
> > 2023-03-06
> > Mar 6, 2023
> > or
> > ...
>
> This reminds me of something that I wanted to bring up.
>
> Perhaps one _could_ define a sufficiently rich regular expression that
> one could match a number of date formats. However, I submit that one
> _should not_. REs may be sufficiently powerful, but in all likelihood
> what you'll end up with is an unreadable mess; it's like people who
> abuse `sed` or whatever to execute complex, general purpose programs:
> yeah, it's a clever hack, but that doesn't mean you should do it.
>
> Pick the right tool for the job. REs are a powerful tool, but they're
> not the right tool for _every_ job, and I'd argue that once you hit a
> threshold of complexity that'll be mostly self-evident, it's time to
> move on to something else.
>
> As for large vs small REs.... When we start talking about differences
> of orders of magnitude in size, we start talking about real
> performance implications; in general an NDFA simulation of a regular
> expression will have on the order of the length of the RE in states,
> so when the length of the RE is half a million symbols, that's
> half-a-million states, which practically speaking is a pretty big
> number, even though it's bounded is still a pretty big number, and
> even on modern CPUs.
>
> I wouldn't want to poke that bear.
>
> - Dan C.
>
--
Advice is judged by results, not by intentions.
Cicero
[-- Attachment #2: Type: text/html, Size: 4427 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Requesting thoughts on extended regular expressions in grep.
2023-03-07 4:01 ` Ed Bradford
@ 2023-03-07 11:39 ` Ralph Corderoy
2023-03-07 18:31 ` [COFF] " Grant Taylor via COFF
2023-03-08 11:22 ` Ed Bradford
2023-03-07 16:14 ` Dan Cross
1 sibling, 2 replies; 46+ messages in thread
From: Ralph Corderoy @ 2023-03-07 11:39 UTC (permalink / raw)
To: coff
Hi Ed,
> I have made an attempt to make my RE stuff readable and supportable.
Readable to you, which is fine because you're the prime future reader.
But it's less readable than the regexp to those that know and read them
because of the indirection introduced by the variables. You've created
your own little language of CAPITALS rather than the lingua franca of
regexps. :-)
> Machine language was unreadable and then along came assembly language.
> Assembly language was unreadable, then came higher level languages.
Each time the original language was readable because practitioners had
to read and write it. When its replacement came along, the old skill
was no longer learnt and the language became ‘unreadable’.
> So far, I can do that for this RE program that works for small files,
> large files, binary files and text files for exactly one pattern:
> YYYY[-MM-DD]
> I constructed this RE with code like this:
> # ymdt is YYYY-MM-DD RE in text.
> # looking only for 1900s and 2000s years and no later than today.
> _YYYY = "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE) + "]" + "){1}"
‘{1}’ is redundant.
> # months
> _MM = "(0[1-9]|1[012])"
> # days
> _DD = "(0[1-9]|[12]\d|3[01])"
> ymdt = _YYYY + '[' + _INTERNALSEP +
> _MM +
> _INTERNALSEP +
> ']'{0,1)
I think we're missing something as the ‘'['’ is starting a character
class which is odd for wrapping the month and the ‘{0,1)’ doesn't have
matching brackets and is outside the string.
BTW, ‘{0,1}’ is more readable to those who know regexps as ‘?’.
> For the whole file, RE I used
> ymdthf = _FRSEP + ymdt + _BASEP
> where FRSEP is front separator which includes
> a bunch of possible separators, excluding numbers and letters, or-ed
> with the up arrow "beginning of line" RE mark.
It sounds like you're wanting a word boundary; something provided by
regexps. In Python, it's ‘\b’.
>>> re.search(r'\bfoo\b', 'endfoo foostart foo ends'),
(<re.Match object; span=(16, 19), match='foo'>,)
Are you aware of the /x modifier to a regexp which ignores internal
whitespace, including linefeeds? This allows a large regexp to be split
over lines. There's a comment syntax too. See
https://docs.python.org/3/library/re.html#re.X
GNU grep isn't too shabby at looking through binary files. I can't use
/x with grep so in a bash script, I'd do it manually. \< and \> match
the start and end of a word, a bit like Python's \b.
re='
.?\<
(19[0-9][0-9]|20[01][0-9]|202[0-3])
(
([-:._])
(0[1-9]|1[0-2])
\3
(0[1-9]|[12][0-9]|3[01])
)?
\>.?
'
re=${re//$'\n'/}
re=${re// /}
printf '%s\n' 2001-04-01,1999_12_31 1944.03.01,1914! 2000-01.01 >big-binary-file
LC_ALL=C grep -Eboa "$re" big-binary-file | sed -n l
which gives
0:2001-04-01,$
11:1999_12_31$
22:1944.03.01,$
33:1914!$
39:2000-$
showing:
- the byte offset within the file of each match,
- along with the any before and after byte if it's not a \n and not
already matched, just to show the word-boundary at work,
- with any non-printables escaped into octal by sed.
> I thought I was on the COFF mailing list.
I'm sending this to just the list.
> I received this email by direct mail to from Larry.
Perhaps your account on the list is configured to not send you an email
if it sees your address in the header's fields.
--
Cheers, Ralph.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-07 4:01 ` Ed Bradford
2023-03-07 11:39 ` [COFF] " Ralph Corderoy
@ 2023-03-07 16:14 ` Dan Cross
2023-03-07 17:34 ` [COFF] " Ralph Corderoy
1 sibling, 1 reply; 46+ messages in thread
From: Dan Cross @ 2023-03-07 16:14 UTC (permalink / raw)
To: Ed Bradford; +Cc: Grant Taylor, COFF
On Mon, Mar 6, 2023 at 11:01 PM Ed Bradford <egbegb2@gmail.com> wrote:
>[snip]
> I think it is possible to make a 50K RE that is understandable. However, it requires
> a lot of 'splainin' throughout the code. I'm naive though; I will eventually discover
> a lack of truth in that belief, if such exists.
Actually, I believe you. I'm sure that with enough effort, it _is_
possible to make a 50K RE that is understandable to mere mortals. But
it begs the question: why bother? The answer to that question, in my
mind, shows the difference between a clever programmer and a pragmatic
engineer. I submit that it's time to reach for another tool well
before you get to an RE that big, and if one is still considering such
a thing, one must really ask what properties of REs and the problem at
hand one thinks lend itself to that as the solution.
>[snip]
> It sounds to me like an "optimizer" is needed. There is alreay a compiler
> that uses FA's.
I'm not sure what you're referring to here, though you were replying
to me. There are a couple of different threads floating around:
1. Writing really big regular expressions: this is probably a bad
idea. Don't do it (see below).
2. Writing a recognizer for dates. Yeah, the small REs you have for
that are fine. If you want to extend those to arbitrary date formats,
I think you'll find it starts getting ugly.
3. Optimizing regular expressions. You're still bound by the known
theoretical properties of finite automata here.
> Is someone else going to create a program
> to look for dates without using regular expressions?
Many people have already done so. :-)
> Today, I write small-sized RE's. If I write a giant RE, there is nothing preventing
> the owner of RE world to change how they are used. For instance. Compile your RE
> and a subroutine/function is produced that performs the RE search.
I'm not sure I understand what you mean.
The theory here is well-understood: we know recognizers for regular
languages can be built from DFAs, that run in time linear in the size
of their inputs, but we also know that constructing such a DFA can be
exponential in space and time, and thus impractical for many REs.
We know that NDFA simulators can be built in time and space linear in
the length of the RE, but that the resulting recognizers will be
superlinear at runtime, proportional to the product of the length of
input, number of states, and number edges between states in the state
transition graph. For a very large regular expression, that's going to
be a pretty big number, and even on modern CPUs won't be particularly
fast. Compilation to native code won't really help you.
There is no "owner of RE world" that can change that. If you can find
some way to do so, I think that would qualify as a major breakthrough
in computer science.
> RE is a language, not necessarily an implementation.
> At least that is my understanding.
Regular expressions describe regular languages, but as I mentioned
above, the theory gives the currently understood bounds for their
performance characteristics. It's kinda like the speed of light in
this regard; we can't really make it go faster.
- Dan C.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Requesting thoughts on extended regular expressions in grep.
2023-03-07 16:14 ` Dan Cross
@ 2023-03-07 17:34 ` Ralph Corderoy
2023-03-07 18:33 ` [COFF] " Dan Cross
0 siblings, 1 reply; 46+ messages in thread
From: Ralph Corderoy @ 2023-03-07 17:34 UTC (permalink / raw)
To: coff
Hi Dan,
> I'm sure that with enough effort, it _is_ possible to make a 50K RE
> that is understandable to mere mortals. But it begs the question: why
> bother?
It could be the quickest way to express the intent.
> The answer to that question, in my mind, shows the difference between
> a clever programmer and a pragmatic engineer.
I think those two can overlap. :-)
> I submit that it's time to reach for another tool well before you get
> to an RE that big
Why, if the grammar is type three in Chomsky's hierarchy, i.e. a regular
grammar? I think sticking with code aimed at regular grammars, or more
likely regexps, will do better than, say, a parser generator for a
type-two context-free grammar.
As well as the lex(1) family, there's Ragel as another example.
http://www.colm.net/open-source/ragel/
> 3. Optimizing regular expressions. You're still bound by the known
> theoretical properties of finite automata here.
Well, we're back to the RE v. regexp again. /^[0-9]+\.jpeg$/ is matched
by some engines by first checking the last five bytes are ‘.jpeg’.
$ debugperl -Dr -e \
> '"123546789012354678901235467890123546789012.jpg" =~ /^[0-9]+\.jpeg$/'
...
Matching REx "^[0-9]+\.jpeg$" against "123546789012354678901235467890123546789012.jpg"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [1..46] gave -1
Did not find floating substr ".jpeg"$...
Match rejected by optimizer
Freeing REx: "^[0-9]+\.jpeg$"
$
Boyer-Moore string searching can be used. Common-subregexp-elimination
can spot repetitive fragment of regexp and factor them into a single set
of states along with pairing the route into them with the appropriate
route out.
The more regexp engines are optimised, the more benefit to the
programmer from sticking to a regexp rather than, say, ad hoc parsing.
The theory of REs is interesting and important, but regexps deviate from
it ever more.
--
Cheers, Ralph.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-07 11:39 ` [COFF] " Ralph Corderoy
@ 2023-03-07 18:31 ` Grant Taylor via COFF
2023-03-08 11:22 ` Ed Bradford
1 sibling, 0 replies; 46+ messages in thread
From: Grant Taylor via COFF @ 2023-03-07 18:31 UTC (permalink / raw)
To: coff
[-- Attachment #1: Type: text/plain, Size: 4934 bytes --]
On 3/7/23 4:39 AM, Ralph Corderoy wrote:
> Readable to you, which is fine because you're the prime future
> reader. But it's less readable than the regexp to those that know
> and read them because of the indirection introduced by the variables.
> You've created your own little language of CAPITALS rather than the
> lingua franca of regexps. :-)
I want to agree, but then I run into things like this:
^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+
postfix(/smtps)?/smtpd\[[[:digit:]]+\]: disconnect from
[._[:alnum:]-]+\[[.:[:xdigit:]]+\]( helo=[[:digit:]]+(/[[:digit:]]+)?)?(
ehlo=[[:digit:]]+(/[[:digit:]]+)?)?(
starttls=[[:digit:]]+(/[[:digit:]]+)?)?(
auth=[[:digit:]]+(/[[:digit:]]+)?)?(
mail=[[:digit:]]+(/[[:digit:]]+)?)?(
rcpt=[[:digit:]]+(/[[:digit:]]+)?)?(
data=[[:digit:]]+(/[[:digit:]]+)?)?(
bdat=[[:digit:]]+(/[[:digit:]]+)?)?(
rset=[[:digit:]]+(/[[:digit:]]+)?)?(
noop=[[:digit:]]+(/[[:digit:]]+)?)?(
quit=[[:digit:]]+(/[[:digit:]]+)?)?(
unknown=[[:digit:]]+(/[[:digit:]]+)?)?(
commands=[[:digit:]]+(/[[:digit:]]+)?)?$
Which is produced by this m4:
define(`DAEMONPID', `$1\[DIGITS\]:')dnl
define(`DATE', `\w{3} [ :[:digit:]]{11}')dnl
define(`DIGIT', `[[:digit:]]')dnl
define(`DIGITS', `DIGIT+')dnl
define(`HOST', `[._[:alnum:]-]+')dnl
define(`HOSTIP', `HOST\[IP\]')dnl
define(`IP', `[.:[:xdigit:]]+')dnl
define(`VERB', `( $1=DIGITS`'(/DIGITS)?)?')dnl
^DATE HOST DAEMONPID(`postfix(/smtps)?/smtpd') disconnect from
HOSTIP`'VERB(`helo')VERB(`ehlo')VERB(`starttls')VERB(`auth')VERB(`mail')VERB(`rcpt')VERB(`data')VERB(`bdat')VERB(`rset')VERB(`noop')VERB(`quit')VERB(`unknown')VERB(`commands')$
I only consider myself to be an /adequate/ m4 user. Though I've done
some things that are arguably creating new languages.
I personally find the generated regular expression to be onerous to read
and understand, much less modify. I would be highly dependent on my
editor's (vim's) parenthesis / square bracket matching (%) capability
and / or would need to explode the RE into multiple components on
multiple lines to have a hope of accurately understanding or modifying it.
Conversely I think that the m4 is /largely/ find and replace with a
little syntactic sugar around the definitions.
I also think that anyone that does understand regular expressions and
the concept of find & replace is likely to be able to both recognize
patterns -- as in "VERB(...)" corresponds to "(
$1=DIGITS`'(/DIGITS)?)?", that "DIGITS" corresponds to "DIGIT+", and
that "DIGIT" corresponds to "[[:digit:]]".
There seems to be a point between simple REs w/o any supporting
constructor and complex REs with supporting constructor where I think it
is better to have the constructors. Especially when duplication comes
into play.
If nothing else, the constructors are likely to reduce one-off typo
errors. The typo will either be everywhere the constructor was used, or
similarly be fixed everywhere at the same time. Conversely, finding an
unmatched parenthesis or square bracket in the RE above will be annoying
at best if not likely to be more daunting.
> Each time the original language was readable because practitioners
> had to read and write it. When its replacement came along, the old
> skill was no longer learnt and the language became ‘unreadable’.
I feel like there is an analogy between machine code and assembly
language as well as assembly language and higher level languages.
My understanding is that the computer industry has vastly agreed that
the higher level language is easier to understand and maintain.
> ‘{1}’ is redundant.
That may very well be. But what will be more maintainable / easier to
correct in the future; adding `{2}` when necessary or changing the value
of `1` to `2`?
I think this is an example of tradeoff of not strictly required to make
something more maintainable down the road. Sort of like fleet vehicles
vs non-fleet vehicles.
> BTW, ‘{0,1}’ is more readable to those who know regexps as ‘?’.
I think this is another example of the maintainability.
> I'm sending this to just the list.
I'm also replying to only the COFF mailing list.
> Perhaps your account on the list is configured to not send you an
> email if it sees your address in the header's fields.
There is a reasonable chance that the COFF mailing list and / or your
account therein is configured to minimize duplicates meaning the COFF
mailing list won't send you a copy if it sees your subscribed address as
receiving a copy directly.
I personally always prefer the mailing list copy and shun the direct
copies. I think that the copy from the mailing list keeps the
discussion on the mailing list and avoids accidental replies bypassing
the mailing list.
--
Grant. . . .
unix || die
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4017 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-07 17:34 ` [COFF] " Ralph Corderoy
@ 2023-03-07 18:33 ` Dan Cross
0 siblings, 0 replies; 46+ messages in thread
From: Dan Cross @ 2023-03-07 18:33 UTC (permalink / raw)
To: Ralph Corderoy; +Cc: coff
On Tue, Mar 7, 2023 at 12:34 PM Ralph Corderoy <ralph@inputplus.co.uk> wrote:
> > I'm sure that with enough effort, it _is_ possible to make a 50K RE
> > that is understandable to mere mortals. But it begs the question: why
> > bother?
>
> It could be the quickest way to express the intent.
Ok, I challenge you to find me anything for which the quickest way to
express the intent is a 50 *thousand* symbol regular expression. :-)
> > The answer to that question, in my mind, shows the difference between
> > a clever programmer and a pragmatic engineer.
>
> I think those two can overlap. :-)
Indeed they can. But I have grave doubts that this is a good example
of such overlap.
> > I submit that it's time to reach for another tool well before you get
> > to an RE that big
>
> Why, if the grammar is type three in Chomsky's hierarchy, i.e. a regular
> grammar? I think sticking with code aimed at regular grammars, or more
> likely regexps, will do better than, say, a parser generator for a
> type-two context-free grammar.
Is there an extant, non-theoretical, example of such a grammar?
> As well as the lex(1) family, there's Ragel as another example.
> http://www.colm.net/open-source/ragel/
This is moving the goal posts more than a bit. I'm suggesting that a
50k-symbol RE is unlikely to be the best solution to any reasonable
problem. A state-machine generator, even one with 50k statements, is
not a 50k RE.
> > 3. Optimizing regular expressions. You're still bound by the known
> > theoretical properties of finite automata here.
>
> Well, we're back to the RE v. regexp again. /^[0-9]+\.jpeg$/ is matched
> by some engines by first checking the last five bytes are ‘.jpeg’.
...in general, in order to find the end, won't _something_ have to
traverse the entire input?
(Note that I said, "in general". Allusions to mmap'ed files or seeking
to the end of a file are not general, since they don't apply well to
important categories of input sources, such as pipes or network
connections.)
> $ debugperl -Dr -e \
> > '"123546789012354678901235467890123546789012.jpg" =~ /^[0-9]+\.jpeg$/'
> ...
> Matching REx "^[0-9]+\.jpeg$" against "123546789012354678901235467890123546789012.jpg"
> Intuit: trying to determine minimum start position...
> doing 'check' fbm scan, [1..46] gave -1
> Did not find floating substr ".jpeg"$...
> Match rejected by optimizer
> Freeing REx: "^[0-9]+\.jpeg$"
> $
>
> Boyer-Moore string searching can be used. Common-subregexp-elimination
> can spot repetitive fragment of regexp and factor them into a single set
> of states along with pairing the route into them with the appropriate
> route out.
Well, that's what big-O notation accounts for.
I'm afraid none of this really changes the time bounds, however, when
applied in general.
> The more regexp engines are optimised, the more benefit to the
> programmer from sticking to a regexp rather than, say, ad hoc parsing.
This is comparing apples and oranges. There may be all sorts of
heuristics that we can apply to specific regular expressions to prune
the search space, and that's great. But by their very nature,
heuristics are not always generally applicable.
As an analogy, we know that we cannot solve _the_ halting problem, but
we also know that we can solve _many_ halting problem_s_. For example,
a compiler can recognize that any of, `for(;;);` or `while(1);` or
`loop {}` do not halt, and so on, ad nauseum, but even if some oracle
can recognize arbitrarily many such halting problems, we still haven't
solved the general problem.
> The theory of REs is interesting and important, but regexps deviate from
> it ever more.
Yup. My post should not be construed as suggesting that regexps are
not useful, or that they should not be a part of a programmer's
toolkit. My post _should_ be construed as a suggestion that they are
not always the best solution, and a part of being an engineer is
finding that dividing line.
- Dan C.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)
2023-03-04 10:15 ` [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) Ralph Corderoy
@ 2023-03-07 21:49 ` Tomasz Rola
2023-03-07 22:46 ` Tomasz Rola
2023-06-20 16:02 ` Michael Parson
1 sibling, 1 reply; 46+ messages in thread
From: Tomasz Rola @ 2023-03-07 21:49 UTC (permalink / raw)
To: coff
On Sat, Mar 04, 2023 at 10:15:33AM +0000, Ralph Corderoy wrote:
> Hi,
>
> Grant wrote:
> > Even inventorying and keeping track of the books can be time
> > consuming. -- Thankfully I took some time to do exactly that and
> > have access to that information on the super computer in my pocket.
>
> I seek recommendations for an Android app to comfortably read PDFs on a
> mobile phone's screen. They were intended to be printed as a book. In
> particular, once I've zoomed and panned to get the interesting part of a
> page as large as possible, swiping between pages should persist that
> view. An extra point for allowing odd and even pages to use different
> panning.
My own recommendation for this is to get a dedicated ebook reader. It
will feel a bit clumsy to have both a cretinphone and another thing
with you, but at least the thing is doing the job. At least, mine
keeps cropping across pages. Also, the e-ink/epaper display of ebook
reader is not supposed to screw your eyes and/or circadian rhythms
(not that I know anything specific, but I find it very strange that
people shine blue light into their eyes for extended periods of time
and do not even quietly protest - well, perhaps it is akin to what
goes between human and a dog, they become alike to each other, now,
when a human has cretinphone...).
Or, if it just one pdf to read, then you should be fine reading it on
bigger screen.
HTH
--
Regards,
Tomasz Rola
--
** A C programmer asked whether computer had Buddha's nature. **
** As the answer, master did "rm -rif" on the programmer's home **
** directory. And then the C programmer became enlightened... **
** **
** Tomasz Rola mailto:tomasz_rola@bigfoot.com **
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)
2023-03-07 21:49 ` [COFF] " Tomasz Rola
@ 2023-03-07 22:46 ` Tomasz Rola
0 siblings, 0 replies; 46+ messages in thread
From: Tomasz Rola @ 2023-03-07 22:46 UTC (permalink / raw)
To: coff
On Tue, Mar 07, 2023 at 10:49:14PM +0100, Tomasz Rola wrote:
[...]
> people shine blue light into their eyes for extended periods of time
> and do not even quietly protest - well, perhaps it is akin to what
> goes between human and a dog, they become alike to each other, now,
> when a human has cretinphone...).
To answer unasked question, I own a cretinphone too :-). And few
dumbs. Together, they sum up to something like cretinphone on
steroids.
--
Regards,
Tomasz Rola
--
** A C programmer asked whether computer had Buddha's nature. **
** As the answer, master did "rm -rif" on the programmer's home **
** directory. And then the C programmer became enlightened... **
** **
** Tomasz Rola mailto:tomasz_rola@bigfoot.com **
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep.
2023-03-07 11:39 ` [COFF] " Ralph Corderoy
2023-03-07 18:31 ` [COFF] " Grant Taylor via COFF
@ 2023-03-08 11:22 ` Ed Bradford
1 sibling, 0 replies; 46+ messages in thread
From: Ed Bradford @ 2023-03-08 11:22 UTC (permalink / raw)
To: Ralph Corderoy; +Cc: coff
[-- Attachment #1: Type: text/plain, Size: 10419 bytes --]
Thank you for the very useful comments. However,
I disagree with you about the RE language. While
I agree all RE experts don't need that, when I was
hiring and gave some software to a new hire (whether
an experienced programmer or a recent college grad)
simply handing over huge RE's to my new hire was
a daunting task to that person. I wrote that stuff
that way to help remind me and anyone who might
use the python program.
I don't claim success. It does help me.
When you say '{1}' is redundant, I think I did that
to avoid any possibility of conflicts with the
next string that is
concatentated to the *Y_* (e.g. '*' or '+' or '{4,7}').
I am embarrassed I did not communicate that
in the code. I had to think about it for a couple of hours
before I recalled the "why". I will fix that.
(it would be difficult to discuss
this RE if I had to write
"(19\d\d|20[01]\d|202" + "[0-" + lastYearRE + "]" + ")
rather than just *Y_*).
My initial thoughts
on naming were I wanted the definition to be defined
in exactly one place in the software.
Python and the BTL folks told me to never
use a constant in code. Always name it.
Hence, I gave it a name. Each name might
be used in multiple places. They might be imported.
You are correct, the expression is unbalanced. I tried
to remove the text2bytes(lastYearRE*)* call so the expression in
this email was all text. I failed to remove the trailing *)* when
I removed the call to text2bytes(). My hasty transcriptions
might have produced similar errors in my email.
Recall, my focus was on any file of any size.
I'm on Windows 10 and an m1 MacBook.
Python works on both. I don't have
a Linux machine or enough desktop space to
host one. I'm also mildly fed-up with
virtual machines.
Friedl taught me one thing. Most
RE implementations are different. I'm trying
to write a program that I could give
to anyone and could reliably find a date (an RE) in
any file. YYYY, MM, DD, HR, MI, SE, TH are words
my user could use in the command line or in
an options dialog. LAT and LON might also be
possibilities. CST, EST, MST, PST, ... also.
A 500 gigabyte archive or directory/folder
of pictures and movies would be
a great test target.
I very much appreciate your comments. If this
discussion is boring to others, I would be happy
to take it to emails.
I like your program. My experience
with RE, grep, python, and sed suggests that
anything but gnu grep and sed might not work due to the
different implementations.
I've been out of the Unix software business
for 30 years after starting work at BTL in the 1970s
and working on Version 6. I didn't know "printf" was now
built into bash! That was a surprise. It's an incremental
improvement, but doesn't compare with f-strings in python.
*The interactive interpreter for python should have*
*a "bash" mode?!*
Does grep use a memory mapped file for its search, thereby
avoiding all buffering boundaries? That too, would
be new information to me. The additional complexity
of dealing with buffering is more than annoying.
Do you have any thoughts on how to verify
a program that uses RE's. I've given no thought
until now. My first thought for dates would be
to write a separate module that simply searched
through the file looking for 4 numbers in a row
without using RE's, recording the offsets and 16 characters
after and 1 character before in a python list of (offset,str)
of tuples, ddddList, and using *dddd**List*
as a proxy for the entire file. I could then
aim my RE's at *ddddList*. *[A list of tuples in python*
*is wonderful! !]* It seems to me '*' and '+' and {x,y} are the performance
hogs in RE's. My RE's avoid them. One pass, I think, should
suffice. What do you think? I haven't "archived" my 350 GB
of pictures and movies, but one pass over all files therein
ought to suffice, right? Two different programs that use different
algorithms should be pretty good proof of correctness wouldn't
you think?
My RE's have no stars or pluses. If there is a mismatch before
a match, give up and move on.
On my Windows 10 machine, I have cygwin.
Microsoft says my CPU doesn't have a TPM and
the specific Intel Core I7 on my system is not
supported so Windows 11 is not happening.
Microsoft is DOS personified.
(An unkind editorial remark about the low
quality of software coming from Microsoft.)
Anyway, I thank you again for your patience with me
and your observations. I value your views and the
other views I've seen here on coff@tuhs.org.
I welcome all input to my education and will share
all I have done so far with anyone who wants to
collaborate, test, or is just curious.
GOAL: run python program from an at-cost thumb drive that:
reaps all media files from a user specified
directory/folder tree and
Adds files to the thumb drive.
*Adds files* means
Original file system is untouched
Adds only unique files (hash codes are unique)
Creates on the thumb drive a relative directory
wherein the original file was found
Prepends a "YYYY-MM-DD-" string to the filename
if one can be found (EXIF is great shortcut).
Copies
srcroot/relative_path/oldfilename
to
thumbdrive/relative_path/YYYY-MM-DD-oldfilename
or
thumbdrive/relative_path/0000-oldfilename.
Can also incrementally add new files by just
scanning anywhere in any other computer
file system or any other computer.
Must work on Mac, Windows, and Linux
What I have is a working prototype. It works
on Mac and Windows. It doesn't do the
date thing very well, and there are other shortcomings.
I have delivered exactly one Christmas present to my favorite person
in the world - a 400 GB SSD drive with all our pictures and media
we have ever taken. The next things are to *add *more media
and *re-unique-ify* (check) what is already present on the SSD drive
and *improve the proper choice of "YYYY-MM-DD-" prefix* to
filenames.
I am retired and this is fun.
I'm too old to want to get rich.
Ed Bradford
Pflugerville, TX
egbegb2@gmail.com
On Tue, Mar 7, 2023 at 5:40 AM Ralph Corderoy <ralph@inputplus.co.uk> wrote:
> Hi Ed,
>
> > I have made an attempt to make my RE stuff readable and supportable.
>
> Readable to you, which is fine because you're the prime future reader.
> But it's less readable than the regexp to those that know and read them
> because of the indirection introduced by the variables. You've created
> your own little language of CAPITALS rather than the lingua franca of
> regexps. :-)
>
> > Machine language was unreadable and then along came assembly language.
> > Assembly language was unreadable, then came higher level languages.
>
> Each time the original language was readable because practitioners had
> to read and write it. When its replacement came along, the old skill
> was no longer learnt and the language became ‘unreadable’.
>
> > So far, I can do that for this RE program that works for small files,
> > large files, binary files and text files for exactly one pattern:
> > YYYY[-MM-DD]
> > I constructed this RE with code like this:
> > # ymdt is YYYY-MM-DD RE in text.
> > # looking only for 1900s and 2000s years and no later than today.
> > _YYYY = "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE) + "]" + "){1}"
>
> ‘{1}’ is redundant.
>
> > # months
> > _MM = "(0[1-9]|1[012])"
> > # days
> > _DD = "(0[1-9]|[12]\d|3[01])"
> > ymdt = _YYYY + '[' + _INTERNALSEP +
> > _MM +
> > _INTERNALSEP +
> > ']'{0,1)
>
> I think we're missing something as the ‘'['’ is starting a character
> class which is odd for wrapping the month and the ‘{0,1)’ doesn't have
> matching brackets and is outside the string.
>
> BTW, ‘{0,1}’ is more readable to those who know regexps as ‘?’.
>
> > For the whole file, RE I used
> > ymdthf = _FRSEP + ymdt + _BASEP
> > where FRSEP is front separator which includes
> > a bunch of possible separators, excluding numbers and letters, or-ed
> > with the up arrow "beginning of line" RE mark.
>
> It sounds like you're wanting a word boundary; something provided by
> regexps. In Python, it's ‘\b’.
>
> >>> re.search(r'\bfoo\b', 'endfoo foostart foo ends'),
> (<re.Match object; span=(16, 19), match='foo'>,)
>
> Are you aware of the /x modifier to a regexp which ignores internal
> whitespace, including linefeeds? This allows a large regexp to be split
> over lines. There's a comment syntax too. See
> https://docs.python.org/3/library/re.html#re.X
>
> GNU grep isn't too shabby at looking through binary files. I can't use
> /x with grep so in a bash script, I'd do it manually. \< and \> match
> the start and end of a word, a bit like Python's \b.
>
> re='
> .?\<
> (19[0-9][0-9]|20[01][0-9]|202[0-3])
> (
> ([-:._])
> (0[1-9]|1[0-2])
> \3
> (0[1-9]|[12][0-9]|3[01])
> )?
> \>.?
> '
> re=${re//$'\n'/}
> re=${re// /}
>
> printf '%s\n' 2001-04-01,1999_12_31 1944.03.01,1914! 2000-01.01
> >big-binary-file
> LC_ALL=C grep -Eboa "$re" big-binary-file | sed -n l
>
> which gives
>
> 0:2001-04-01,$
> 11:1999_12_31$
> 22:1944.03.01,$
> 33:1914!$
> 39:2000-$
>
> showing:
>
> - the byte offset within the file of each match,
> - along with the any before and after byte if it's not a \n and not
> already matched, just to show the word-boundary at work,
> - with any non-printables escaped into octal by sed.
>
> > I thought I was on the COFF mailing list.
>
> I'm sending this to just the list.
>
> > I received this email by direct mail to from Larry.
>
> Perhaps your account on the list is configured to not send you an email
> if it sees your address in the header's fields.
>
> --
> Cheers, Ralph.
>
--
Advice is judged by results, not by intentions.
Cicero
[-- Attachment #2: Type: text/html, Size: 25039 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)
2023-03-04 10:15 ` [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) Ralph Corderoy
2023-03-07 21:49 ` [COFF] " Tomasz Rola
@ 2023-06-20 16:02 ` Michael Parson
2023-06-20 21:26 ` Tomasz Rola
1 sibling, 1 reply; 46+ messages in thread
From: Michael Parson @ 2023-06-20 16:02 UTC (permalink / raw)
To: coff
On Sat, 4 Mar 2023, Ralph Corderoy wrote:
> Date: Sat, 4 Mar 2023 04:15:33
> From: Ralph Corderoy <ralph@inputplus.co.uk>
> To: coff@tuhs.org
> Subject: [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on
> extended regular expressions in grep.)
>
> Hi,
>
> Grant wrote:
>> Even inventorying and keeping track of the books can be time
>> consuming. -- Thankfully I took some time to do exactly that and have
>> access to that information on the super computer in my pocket.
>
> I seek recommendations for an Android app to comfortably read PDFs on
> a mobile phone's screen. They were intended to be printed as a book.
> In particular, once I've zoomed and panned to get the interesting part
> of a page as large as possible, swiping between pages should persist
> that view. An extra point for allowing odd and even pages to use
> different panning.
Sorry for responding to an old thread got behind on my list-mail
reading, but I wanted to share my $.02.
Someone else mentioned an e-book reader app, and I second that,
mostly...Moon+ Reader on Android is the e-book reader I've been using
for a while and it does a good job with standard e-book formats
as well as PDF files, IF the PDF is a PDF of formatted text. It
even has a mode where it will do a pretty decent job of on-the-fly
converting/reformatting the text of the PDF to something that can
actually be read on a small (phone) screen. However, if the PDF is just
a bunch of 1 image per page wrapped in a PDF container, you're out of
luck and back to zoom/pan around the page.
For most of my digtal book reading these days, I use a Boox e-ink
reader. It runs Android, so, I can use the same e-book reader I used
on my phone. It can even sync where you're at in the book/document via
dropbox and you can move between multiple devices if needed.
If I want to mark-up the PDF, the built-in stuff on the Boox handles
that nicely. If I'm on my phone, I use an app called Xodo.
--
Michael Parson
Pflugerville, TX
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)
2023-06-20 16:02 ` Michael Parson
@ 2023-06-20 21:26 ` Tomasz Rola
2023-06-22 15:45 ` Michael Parson
0 siblings, 1 reply; 46+ messages in thread
From: Tomasz Rola @ 2023-06-20 21:26 UTC (permalink / raw)
To: Computer Old Farts Followers
On Tue, Jun 20, 2023 at 11:02:33AM -0500, Michael Parson wrote:
> On Sat, 4 Mar 2023, Ralph Corderoy wrote:
>
> > Date: Sat, 4 Mar 2023 04:15:33
> > From: Ralph Corderoy <ralph@inputplus.co.uk>
> > To: coff@tuhs.org
> > Subject: [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on
> > extended regular expressions in grep.)
> >
> > Hi,
> >
> > Grant wrote:
> > > Even inventorying and keeping track of the books can be time
> > > consuming. -- Thankfully I took some time to do exactly that and have
> > > access to that information on the super computer in my pocket.
> >
> > I seek recommendations for an Android app to comfortably read PDFs on
> > a mobile phone's screen. They were intended to be printed as a book.
> > In particular, once I've zoomed and panned to get the interesting part
> > of a page as large as possible, swiping between pages should persist
> > that view. An extra point for allowing odd and even pages to use
> > different panning.
>
> Sorry for responding to an old thread got behind on my list-mail
> reading, but I wanted to share my $.02.
>
> Someone else mentioned an e-book reader app, and I second that,
> mostly...Moon+ Reader on Android is the e-book reader I've been using
> for a while and it does a good job with standard e-book formats
> as well as PDF files, IF the PDF is a PDF of formatted text. It
> even has a mode where it will do a pretty decent job of on-the-fly
> converting/reformatting the text of the PDF to something that can
> actually be read on a small (phone) screen. However, if the PDF is just
> a bunch of 1 image per page wrapped in a PDF container, you're out of
> luck and back to zoom/pan around the page.
>
> For most of my digtal book reading these days, I use a Boox e-ink
> reader. It runs Android, so, I can use the same e-book reader I used
> on my phone. It can even sync where you're at in the book/document via
> dropbox and you can move between multiple devices if needed.
>
> If I want to mark-up the PDF, the built-in stuff on the Boox handles
> that nicely. If I'm on my phone, I use an app called Xodo.
>
> --
> Michael Parson
> Pflugerville, TX
Hello Michael,
I am not challenging your choices (to each his/her own), but to add
some alternative, my own preferences go toward:
a. have sd card slot in a reader (I mean hardware with e-ink, not some
app on a phone). This means a card can be slipped into the box without
opening it. This means the box is not water-proof. However, I had a
look inside and I suspect it can still be water-prooved with duct
tape, if someone wants it so much.
b. so far I was rather happy with Linux custom made by manufacturer,
but not an Android - I am yet to try Android based ebook reader (but
maybe not too fast). Phones with A* are rather lousy at keeping their
batteries loaded, I wonder how eink devices fare - do they, too, need
to be recharged every day? My reader is recharged every 2-3 weeks,
when batt drops to about 70%, while I keep using it at least every
second day for few hours at a time.
I had once (many years go, when I was to buy my first reader) a dream
of browsing web pages with it. However, built in browser in non-A*
reader proved to be lousy, equally lousy to browser in A* phones that
I have tried. So, my current ereader was never connected to the net
because I see no point. Of course each model nowadays comes with
wi-fi, it just does not add anything useful so no need to even
configure it on home router etc. Nowadays, I would rather convert
whatever to pdf or epub and upload to the reader. Reading wikipedia
pages printed to pdf saved me plenty of grief, as opposed to trying
them in a (builtin) browser. I suspect elinks could look much better,
but trying this requires some free time (compiling/porting,
uploading).
As a side note, I have observed that some pdfs do not render too well
on my reader - it seems that they make this software "too new" to be
solid & fast nowadays. Same pdfs converted to djvu read like a dream,
however. Having more than few supported book formats is nice.
My reader also comes with BT, possibly meant to connect headphones but
perhaps usable for BT keyboard. Might be a thing to try in a future
(or not), I mention it to let others know there may be such an option
in case they care about it (I really do not, but I do not make those
things so what can I do...).
HTH
--
Regards,
Tomasz Rola
--
** A C programmer asked whether computer had Buddha's nature. **
** As the answer, master did "rm -rif" on the programmer's home **
** directory. And then the C programmer became enlightened... **
** **
** Tomasz Rola mailto:tomasz_rola@bigfoot.com **
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)
2023-06-20 21:26 ` Tomasz Rola
@ 2023-06-22 15:45 ` Michael Parson
2023-07-10 9:08 ` [COFF] Re: Reader, paper, tablet, phone (was: Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)) Tomasz Rola
0 siblings, 1 reply; 46+ messages in thread
From: Michael Parson @ 2023-06-22 15:45 UTC (permalink / raw)
To: Computer Old Farts Followers
On Tue, 20 Jun 2023, Tomasz Rola wrote:
> On Tue, Jun 20, 2023 at 11:02:33AM -0500, Michael Parson wrote:
<snip>
>> Sorry for responding to an old thread got behind on my list-mail
>> reading, but I wanted to share my $.02.
>>
>> Someone else mentioned an e-book reader app, and I second that,
>> mostly...Moon+ Reader on Android is the e-book reader I've been using
>> for a while and it does a good job with standard e-book formats
>> as well as PDF files, IF the PDF is a PDF of formatted text. It
>> even has a mode where it will do a pretty decent job of on-the-fly
>> converting/reformatting the text of the PDF to something that can
>> actually be read on a small (phone) screen. However, if the PDF is just
>> a bunch of 1 image per page wrapped in a PDF container, you're out of
>> luck and back to zoom/pan around the page.
>>
>> For most of my digtal book reading these days, I use a Boox e-ink
>> reader. It runs Android, so, I can use the same e-book reader I used
>> on my phone. It can even sync where you're at in the book/document via
>> dropbox and you can move between multiple devices if needed.
>>
>> If I want to mark-up the PDF, the built-in stuff on the Boox handles
>> that nicely. If I'm on my phone, I use an app called Xodo.
>>
> Hello Michael,
>
> I am not challenging your choices (to each his/her own), but to add
> some alternative, my own preferences go toward:
>
> a. have sd card slot in a reader (I mean hardware with e-ink, not some
> app on a phone). This means a card can be slipped into the box without
> opening it. This means the box is not water-proof. However, I had a
> look inside and I suspect it can still be water-prooved with duct
> tape, if someone wants it so much.
(For the record, the "device" I'm discussing is my Boox Nova Air 7.8"
E Ink tablet, Amazon product B09FQ14Z6N.)
What are you wanting/needing an SD card for? Extra storage?
File-transfer? For what I use this device for, the built-in storage
(32GB) has been more than enough to hold the books & notes I need it to.
For file transfer, the USB Type-C port on it goes both ways, you can
connect it to a computer and move files that way, or you can get a
Type-C SD card reader. Plus there's wifi for network transfers.
I store my e-books in Calibre on my Linux desktop, which has a
web-server that the software on the device knows how to talk to, or I
can connect the device and my Linux box together over USB and Calibre
recognizes it as an e-reader device and can push/pull books via its
interface.
This device does not advertise any water-proofing at all.
> b. so far I was rather happy with Linux custom made by manufacturer,
> but not an Android - I am yet to try Android based ebook reader (but
> maybe not too fast). Phones with A* are rather lousy at keeping their
> batteries loaded, I wonder how eink devices fare - do they, too, need
> to be recharged every day? My reader is recharged every 2-3 weeks,
> when batt drops to about 70%, while I keep using it at least every
> second day for few hours at a time.
Out of the box, this device has very aggressive power-saving modes. If
the screen is left off for more than something like 10 minutes, it would
do a full shutdown, which means the next time you hit the power button,
it does a full boot. To be fair, it boots much faster than my phone,
about 15 seconds from hitting the power button to asking me to unlock
the screen. Default settings also turn off the wifi & bluetooth when
the screen is off. In this mode, yeah, it will last a few weeks.
I disabled the shutdown bits, I want to have that "instant on"
experience. Well, as instant as things get with E-Ink. From what I've
noticed, the biggest battery drain seems to be if I use the backlight or
not. I mostly don't, but it's there when other light isn't available.
With my usage patterns and device settings, I probably charge it up to
full once or twice a week, but that's plugging it in when it gets to
about 50% battery. If I left the wifi off, I could probably extend it
even more.
To be completely honest, I've not worried too much about how long things
keep a charge, as long as it can stay charged long enough for me to use
it for what I want to do with it. I have lots of USB chargers spread
around my house, which means that I can tend to have something charging
and still be nearby.
The exception to this is my watch. I have a Fossil Hybrid smart-watch.
It's an analog watch with an e-ink background that can show me
alerts/info from my phone. It only needs to be charged about once every
10 days or so. I don't want a watch that I'd have to recharge daily.
> I had once (many years go, when I was to buy my first reader) a
> dream of browsing web pages with it. However, built in browser in
> non-A* reader proved to be lousy, equally lousy to browser in A*
> phones that I have tried. So, my current ereader was never connected
> to the net because I see no point. Of course each model nowadays
> comes with wi-fi, it just does not add anything useful so no need
> to even configure it on home router etc. Nowadays, I would rather
> convert whatever to pdf or epub and upload to the reader. Reading
> wikipedia pages printed to pdf saved me plenty of grief, as opposed to
> trying them in a (builtin) browser. I suspect elinks could look much
> better, but trying this requires some free time (compiling/porting,
> uploading).
I don't do a lot of web browsing on the device, but when I do, it's
using firefox, not the vendor-provided browser. I'm all-in on the FF
ecosystem, have an FF account, all my systems have their browsers logged
into my FF account, I can send tabs between devices/desktops/laptops
easily, etc. I'm sure the same can be done with Chrome. Firefox on
Android also has a "Desktop site" slider that will re-render the page
like a desktop browser would instead of a mobile one. This is nice on
bigger screen Android devices.
That being said. The latest update to the firmware on this device has
added per-app E-ink refresh settings. So, for reading books in the
book-reader app, I have it set on one of the higher-quality, but slower
refresh modes. I read my books like books, page at a time, generally
little to no scrolling. This is harder to do when browsing the web, so,
I have Firefox set to the lower-quality but faster refresh mode. Yeah,
there is a little ghosting after the scrolling stops, but it's more like
the faint bleed-through you get when reading a newspaper.
> As a side note, I have observed that some pdfs do not render too well
> on my reader - it seems that they make this software "too new" to be
> solid & fast nowadays. Same pdfs converted to djvu read like a dream,
> however. Having more than few supported book formats is nice.
I mostly only deal with epub, mobi, and PDF. The blurb on Amazon says
that it handles "PDF(reflowable), PPT, EPUB, TXT, DJVU, HTML, RTF, FB2,
DOC, MOBI, CHM..."
> My reader also comes with BT, possibly meant to connect headphones but
> perhaps usable for BT keyboard. Might be a thing to try in a future
> (or not), I mention it to let others know there may be such an option
> in case they care about it (I really do not, but I do not make those
> things so what can I do...).
Yup, this device has BT as well, since being an Android device, you can
listen to MP3s, or stream from Pandora/Spotify/etc. I've never set that
up on here, I have other devices that can already do that. The only
BT I've hooked up is a keyboard, and that was mostly for a bit of fun.
I installed an ssh-client on there and spent a few hours logged into
my Linux box reading email. I still kinda want a laptop with an e-ink
display.
The other thing I use this device for is for hand-written notes and
sketches. Writing on the screen with the stylus feels a lot like
writing with a pencil on paper. I'm not an artist by any stretch, but
sometimes writing things out and using boxes, circles, helps sort things
out in my head, or sometimes I'll sketch things out while working on a
design I eventually want to 3d print. All stuff I'd historically done in
a paper notebook, but now I have similarly-sized electronic notepad
where I can erase mistakes, have layers to my drawings (like
photoshop/gimp/etc), zoom in/out, etc.
The notepad app is also linked to my DropBox account, so when I'm done
with whatever doodles/notes/sketches, I can then load them up on one
of my other systems as PDFs. I've even toyed with the idea of writing
letters out in long-hand in the notepad and then emailing the resulting
PDFs to people, since the practice of hand-written letters has gone by
the wayside, I thought this would be an entertaining way to bring that
back. :)
If what you're looking for is an E-reader that can also do the
note-taking (Or a note-taking device that also functions as an
e-reader), and not really much else, I've heard good things about the
Remarkable tablets. One of the things they advertise is it being a
distraction-free device, no alerts from the socials medias, emails, etc,
since the device flat doesn't do those things. I've never used one, but
I've seen people at my $DAYJOB say they really like theirs.
I'm not saying this is the perfect device, or that it will fill the
needs of anyone else. It doesn't solve every need I have either, which
is why I have multiple devices (phone, this e-reader tablet, personal
laptop, work laptop, personal desktop, scattered Raspberry Pis, etc).
--
Michael Parson
Pflugerville, TX
^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Reader, paper, tablet, phone (was: Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.))
2023-06-22 15:45 ` Michael Parson
@ 2023-07-10 9:08 ` Tomasz Rola
0 siblings, 0 replies; 46+ messages in thread
From: Tomasz Rola @ 2023-07-10 9:08 UTC (permalink / raw)
To: Computer Old Farts Followers
On Thu, Jun 22, 2023 at 10:45:43AM -0500, Michael Parson wrote:
> On Tue, 20 Jun 2023, Tomasz Rola wrote:
> > On Tue, Jun 20, 2023 at 11:02:33AM -0500, Michael Parson wrote:
>
> <snip>
>
[...]
> > a. have sd card slot in a reader (I mean hardware with e-ink, not some
> > app on a phone). This means a card can be slipped into the box without
> > opening it. This means the box is not water-proof. However, I had a
> > look inside and I suspect it can still be water-prooved with duct
> > tape, if someone wants it so much.
>
Hello again and sorry for the late answer. Busy, hardware glitches,
tired, etc.
> (For the record, the "device" I'm discussing is my Boox Nova Air 7.8" E Ink
> tablet, Amazon product B09FQ14Z6N.)
>
> What are you wanting/needing an SD card for? Extra storage?
> File-transfer? For what I use this device for, the built-in storage
> (32GB) has been more than enough to hold the books & notes I need it to.
There is no use pattern as such. It was just very convenient to have
one. When I bought my first reader, PocketBook 622 (or something like
it, a good decade ago), it only had 2GB of flash. Of this, about half
was already prefilled with free ebooks, free photos to see in some
app. I decided to keep all this, which left me with .LT. 1GB for my
use. If I wanted just epubs, this was more than enough to have few
thousands files. But, an atlas of human anatomy from Internet Archive
was ca. 8MB, Lisp 1.5 manual from bitsavers was another 9MB and if I
wanted some more, space would quickly ran out. An 8GB card solved the
problem.
Later on, a member of family fell in love with e-reader, too. But, in
her hands they died rather quickly. So we struck a deal - she will buy
me a new one and get mine, which after a year of use was still mostly
like new. I just replugged SD card and it was it. Now I have two
broken Pocketbooks 622 in a drawer, waiting for their chance - just
replace their displays.
Later on, I bought me an Inkpad (again from Pocketbook, probably I am
a fanboy). This one had about 7.8'' display, many more inky pixels and
not too big cpu, so was a bit sluggish. But I liked it. Alas, it
entered nirvana as a result of botched upgrade. If I am to believe
internet, it can be pulled back into this world with the help of
specially crafted serial cable and kermit.
So, later on I bought me a Pocketbook 633 (definitely a fanboy!). This
one can display 256 colors on outer LCD display and 16 levels of grey
on inner e-ink display, which combines into 4096 colors. I wanted to
give it a try, but if this specimmen dies, I will go back to normal
e-ink b/w. The colors are there, but nothing outstanding. Ok for office
kind of documents - pie charts, graphs etc. Space photographs,
ukiyo-e, n nono no. Besides, if I use color, it powers on the LCD
layer and consumes significantly more power than when it goes with
e-ink only. For some documents, color can be turned off, for icons on
main page, they show in color.
Each time I just replug the card. So next time, I guess I will be
looking for another reader with a slot.
As of what you write about using cable and or BT for file transfer, it
is all true but if reader does not work anymore, it might be hard to
get files back. Or impossible. I have a smartphone, just to make a
photo sometimes, nothing more, and I store it all on sdcard,
again. This one phone does not want to emulate pendrive when I plug it
into my Linux box, I had to install ftp server on it (phone). If it
dies, I will unplug the card. Easy as a pie. No need to spend days
looking for some bloody program which maybe is Windows-XP-only or
maybe just been dropped from manufacturer support website. So I guess
my next phone should better have a socket for sdcard, too.
This is the future - if I have a need to support, I need to support it
by me. Investing money in something that may turn into a brick and get
my stuff with it - I do not think so. Too bad if some manufacturer
does not want to play along, but they come by a dozen so maybe I can
find what I want :-)...
[...]
> This device does not advertise any water-proofing at all.
Well, somehow all readers without sdcard tout about them being
waterproof. Like I wanted to read in water.
[...]
> The exception to this is my watch. I have a Fossil Hybrid smart-watch.
> It's an analog watch with an e-ink background that can show me
> alerts/info from my phone. It only needs to be charged about once every
> 10 days or so. I don't want a watch that I'd have to recharge daily.
I bought me a cheap mechanical clock from China (say, 17 bucks). It only
needs a "mechanical recharge" once a day and I have to turn it back 5
minutes once a week to keep it somewhat accurate. Good enough for me.
[...]
> The other thing I use this device for is for hand-written notes and
> sketches. Writing on the screen with the stylus feels a lot like
> writing with a pencil on paper. I'm not an artist by any stretch, but
> sometimes writing things out and using boxes, circles, helps sort things
> out in my head, or sometimes I'll sketch things out while working on a
> design I eventually want to 3d print. All stuff I'd historically done in
> a paper notebook, but now I have similarly-sized electronic notepad
> where I can erase mistakes, have layers to my drawings (like
> photoshop/gimp/etc), zoom in/out, etc.
Well I went the other way - bought some pencils to see how it goes
with paper notes. Color crayons, some other thingies for making
circles and lines (compass, ruler). Now, seems I will have to make
some kind of notebook (in a future, not now) because what I see in my
shops somehow does not convince me to buy (yeah I know about
moleskines and leichturms, but this is not what I want, not sure what
I want). Right now, it is just a normal crappy school notebook but
with nice cover :-) .
> The notepad app is also linked to my DropBox account, so when I'm done
> with whatever doodles/notes/sketches, I can then load them up on one
> of my other systems as PDFs. I've even toyed with the idea of writing
"Dropbox baaad, shoe box good" :-)
--
Regards,
Tomasz Rola
--
** A C programmer asked whether computer had Buddha's nature. **
** As the answer, master did "rm -rif" on the programmer's home **
** directory. And then the C programmer became enlightened... **
** **
** Tomasz Rola mailto:tomasz_rola@bigfoot.com **
^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2023-07-10 9:08 UTC | newest]
Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-02 18:54 [COFF] Requesting thoughts on extended regular expressions in grep Grant Taylor via COFF
2023-03-02 19:23 ` [COFF] " Clem Cole
2023-03-02 19:38 ` Grant Taylor via COFF
2023-03-02 23:01 ` Stuff Received
2023-03-02 23:46 ` Steffen Nurpmeso
2023-03-03 1:08 ` Grant Taylor via COFF
2023-03-03 2:10 ` Dave Horsfall
2023-03-03 3:34 ` Grant Taylor via COFF
2023-03-02 21:53 ` Dan Cross
2023-03-03 1:05 ` Grant Taylor via COFF
2023-03-03 3:04 ` Dan Cross
2023-03-03 3:53 ` Grant Taylor via COFF
2023-03-03 13:47 ` Dan Cross
2023-03-03 19:26 ` Grant Taylor via COFF
2023-03-03 10:59 ` Ralph Corderoy
2023-03-03 13:11 ` Dan Cross
2023-03-03 13:42 ` Ralph Corderoy
2023-03-03 19:19 ` Grant Taylor via COFF
2023-03-04 10:15 ` [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) Ralph Corderoy
2023-03-07 21:49 ` [COFF] " Tomasz Rola
2023-03-07 22:46 ` Tomasz Rola
2023-06-20 16:02 ` Michael Parson
2023-06-20 21:26 ` Tomasz Rola
2023-06-22 15:45 ` Michael Parson
2023-07-10 9:08 ` [COFF] Re: Reader, paper, tablet, phone (was: Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)) Tomasz Rola
2023-03-03 16:12 ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall
2023-03-03 17:13 ` Dan Cross
2023-03-03 17:38 ` Ralph Corderoy
2023-03-03 19:09 ` Dan Cross
2023-03-03 19:36 ` Grant Taylor via COFF
2023-03-04 10:26 ` Ralph Corderoy
2023-03-03 19:06 ` Grant Taylor via COFF
2023-03-03 19:31 ` Dan Cross
2023-03-04 10:07 ` Ralph Corderoy
2023-03-06 10:01 ` Ed Bradford
2023-03-06 21:01 ` Dan Cross
2023-03-06 21:49 ` Steffen Nurpmeso
2023-03-07 1:43 ` Larry McVoy
2023-03-07 4:01 ` Ed Bradford
2023-03-07 11:39 ` [COFF] " Ralph Corderoy
2023-03-07 18:31 ` [COFF] " Grant Taylor via COFF
2023-03-08 11:22 ` Ed Bradford
2023-03-07 16:14 ` Dan Cross
2023-03-07 17:34 ` [COFF] " Ralph Corderoy
2023-03-07 18:33 ` [COFF] " Dan Cross
2023-03-07 4:19 ` Ed Bradford
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).