From mboxrd@z Thu Jan 1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level:
X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED,
DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FROM,MAILING_LIST_MULTI
autolearn=ham autolearn_force=no version=3.4.4
Received: (qmail 14633 invoked from network); 2 Mar 2023 21:54:20 -0000
Received: from minnie.tuhs.org (2600:3c01:e000:146::1)
by inbox.vuxu.org with ESMTPUTF8; 2 Mar 2023 21:54:20 -0000
Received: from minnie.tuhs.org (localhost [IPv6:::1])
by minnie.tuhs.org (Postfix) with ESMTP id D75DE4326A;
Fri, 3 Mar 2023 07:54:16 +1000 (AEST)
Received: from mail-lj1-x235.google.com (mail-lj1-x235.google.com [IPv6:2a00:1450:4864:20::235])
by minnie.tuhs.org (Postfix) with ESMTPS id 5B13843269
for ; Fri, 3 Mar 2023 07:54:10 +1000 (AEST)
Received: by mail-lj1-x235.google.com with SMTP id a32so427192ljr.9
for ; Thu, 02 Mar 2023 13:54:10 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20210112;
h=content-transfer-encoding:cc:to:subject:message-id:date:from
:in-reply-to:references:mime-version:from:to:cc:subject:date
:message-id:reply-to;
bh=eDLrdA8ERbmDek+ncM140pjwn14V4H0DjX1pjR4tUKg=;
b=QqGa6kwCAD3qoAZ2vUzjNWjriHec6tcWw1bnJRIQGm8GnxUX9u2NSAL6dPGtoSvxpA
kTdbvgsjUOyF+N+/iMYr1i8koiDYSDyFYT8LgaHhw+HS3dNstQteKxnvac8ShyaxdlT2
TNC7zWnd3KU4WBhMbjAh9TaVhz6UJOoCLRHBOjjvgv/qscgV+XC1WZLNikfP4IuGNouh
SJ0GjPBxwkEmDNX8kCD7BuYlbIVTAeoa1HdgGZBictF1HbN7sgi3zmbka9DxG9NJlZOc
4jDGGS+u0svL9WMe801e7vL/vtqPQD6tzBBqpg8xXqJJCgMa6JfCE659/4VPt0pDMaXf
RwzA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=content-transfer-encoding:cc:to:subject:message-id:date:from
:in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
:subject:date:message-id:reply-to;
bh=eDLrdA8ERbmDek+ncM140pjwn14V4H0DjX1pjR4tUKg=;
b=hYcYFqI4AAld1D0/Mni3mND8y1/1K42fwuo4pJOCol215214H8jMlSth3TfCjgiSX2
HyICpjjmHv3tG9QMSJVc3O7K9ea40mi4yBvETmMgcevmsvrLJfV12Ty6MroxAS5veOw6
G5ltKlj708Ok6y38qIKwrQFCtY0Dbo0/U/tZJytYyIxdk5qFvOV/ci3smxj7lLDH//J2
cMyIL2WKLC2598/9X7UVVxJlU0AZrApAwaLGNwp9V4BAh5dJ0AWHH+wa8EzmYHaBZy4k
+LKk723THxgV8M8tuJ8DC8qYaPXXdBiNJuNQI38vFxwArVGmVqgdhaK9rV66AFJw6Won
IGJA==
X-Gm-Message-State: AO0yUKXioQwlf3B6cyLGsmaPrMImCZwSoGbPqksB509nceSijeFgGZua
WFYZh6RsVmM/XgqKJrqr4zIWtTFVUUdGDRc/sruC7AzxeNQ=
X-Google-Smtp-Source: AK7set/K5l3VnTajjD1aHoPL/d1XQvyDXsBWUovtAfOYo/RPNwSMSUkcpxFq2T4NlwM37gB5Eu489286uDXTeqZ2GWA=
X-Received: by 2002:a2e:58c:0:b0:295:944c:f335 with SMTP id
134-20020a2e058c000000b00295944cf335mr3771676ljf.1.1677794047847; Thu, 02 Mar
2023 13:54:07 -0800 (PST)
MIME-Version: 1.0
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
In-Reply-To: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
From: Dan Cross
Date: Thu, 2 Mar 2023 16:53:31 -0500
Message-ID:
To: Grant Taylor
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Message-ID-Hash: BVTV7VMKYTAVKLILHHM5LU3WMAG4S7SA
X-Message-ID-Hash: BVTV7VMKYTAVKLILHHM5LU3WMAG4S7SA
X-MailFrom: crossd@gmail.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: COFF
X-Mailman-Version: 3.3.6b1
Precedence: list
Subject: [COFF] Re: Requesting thoughts on extended regular expressions in grep.
List-Id: Computer Old Farts Forum
Archived-At:
List-Archive:
List-Help:
List-Owner:
List-Post:
List-Subscribe:
List-Unsubscribe:
On Thu, Mar 2, 2023 at 1:55=E2=80=AFPM Grant Taylor via COFF wrote:
> I'd like some thoughts ~> input on extended regular expressions used
> with grep, specifically GNU grep -e / egrep.
>
> What are the pros / cons to creating extended regular expressions like
> the following:
>
> ^\w{3}
>
> vs:
>
> ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
Well, obviously the former matches any sequence 3 of
alpha-numerics/underscores at the beginning of a string, while the
latter only matches abbreviations of months in the western calendar;
that is, the two REs are matching very different things (the latter is
a strict subset of the former). But I suspect you mean in a more
general sense.
> Or:
>
> [ :[:digit:]]{11}
...do you really want to match a space, a colon and a single digit 11
times in a single string?
> vs:
>
> ( 1| 2| 3| 4| 5| 6| 7| 8|
> 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31)
> (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]
Using character classes would greatly simplify what you're trying to
do. It seems like this could be simplified to (untested) snippet:
( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9]
For this, I'd probably eschew `[:digit:]`. Named character classes are
for handy locale support, or in lieu of typing every character in the
alphabet (though we can use ranges to abbreviate that), but it kind of
seems like that's not coming into play here and, IMHO, `[0-9]` is
clearer in context.
> I'm currently eliding the 61st (60) second, the 32nd day, and dealing
> with February having fewer days for simplicity.
It's not clear to me that dates, in their generality, can be matched
with regular expressions. Consider leap years; you'd almost
necessarily have to use backtracking for that, but I admit I haven't
thought it through.
> For matching patterns like the following in log files?
>
> Mar 2 03:23:38
>
> I'm working on organically training logcheck to match known good log
> entries. So I'm *DEEP* in the bowels of extended regular expressions
> (GNU egrep) that runs over all logs hourly. As such, I'm interested in
> making sure that my REs are both efficient and accurate or at least not
> WILDLY badly structured. The pedantic part of me wants to avoid
> wildcard type matches (\w), even if they are bounded (\w{3}), unless it
> truly is for unpredictable text.
`\w` is a GNU extension; I'd probably avoid it on portability grounds
(though `\b` is very handy).
The thing about regular expressions is that they describe regular
languages, and regular languages are those for which there exists a
finite automaton that can recognize the language. An important class
of finite automata are deterministic finite automata; by definition,
recognition by such automata are linear in the length of the input.
However, construction of a DFA for any given regular expression can be
superlinear (in fact, it can be exponential) so practically speaking,
we usually construct non-deterministic finite automata (NDFAs) and
"simulate" their execution for matching. NDFAs generalize DFAs (DFAs
are a subset of NDFAs, incidentally) in that, in any non-terminal
state, there can be multiple subsequent states that the machine can
transition to given an input symbol. When executed, for any state, the
simulator will transition to every permissible subsequent state
simultaneously, discarding impossible states as they become evident.
This implies that NDFA execution is superlinear, but it is bounded,
and is O(n*m*e), where n is the length of the input, m is the number
of nodes in the state transition graph corresponding to the NDFA, and
e is the maximum number of edges leaving any node in that graph (for a
fully connected graph, that would m, so this can be up to O(n*m^2)).
Construction of an NDFA is O(m), so while it's slower to execute, it's
actually possible to construct in a reasonable amount of time. Russ's
excellent series of articles that Clem linked to gives details and
algorithms.
> I'd appreciate any feedback and recommendations from people who have
> been using and / or optimizing (extended) regular expressions for longer
> than I have been using them.
>
> Thank you for your time and input.
In practical terms? Basically, don't worry about it too much. Egrep
will generate an NDFA simulation that's going to be acceptably fast
for all but the weirdest cases.
- Dan C.