From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FROM,MAILING_LIST_MULTI autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 14633 invoked from network); 2 Mar 2023 21:54:20 -0000 Received: from minnie.tuhs.org (2600:3c01:e000:146::1) by inbox.vuxu.org with ESMTPUTF8; 2 Mar 2023 21:54:20 -0000 Received: from minnie.tuhs.org (localhost [IPv6:::1]) by minnie.tuhs.org (Postfix) with ESMTP id D75DE4326A; Fri, 3 Mar 2023 07:54:16 +1000 (AEST) Received: from mail-lj1-x235.google.com (mail-lj1-x235.google.com [IPv6:2a00:1450:4864:20::235]) by minnie.tuhs.org (Postfix) with ESMTPS id 5B13843269 for ; Fri, 3 Mar 2023 07:54:10 +1000 (AEST) Received: by mail-lj1-x235.google.com with SMTP id a32so427192ljr.9 for ; Thu, 02 Mar 2023 13:54:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=eDLrdA8ERbmDek+ncM140pjwn14V4H0DjX1pjR4tUKg=; b=QqGa6kwCAD3qoAZ2vUzjNWjriHec6tcWw1bnJRIQGm8GnxUX9u2NSAL6dPGtoSvxpA kTdbvgsjUOyF+N+/iMYr1i8koiDYSDyFYT8LgaHhw+HS3dNstQteKxnvac8ShyaxdlT2 TNC7zWnd3KU4WBhMbjAh9TaVhz6UJOoCLRHBOjjvgv/qscgV+XC1WZLNikfP4IuGNouh SJ0GjPBxwkEmDNX8kCD7BuYlbIVTAeoa1HdgGZBictF1HbN7sgi3zmbka9DxG9NJlZOc 4jDGGS+u0svL9WMe801e7vL/vtqPQD6tzBBqpg8xXqJJCgMa6JfCE659/4VPt0pDMaXf RwzA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=eDLrdA8ERbmDek+ncM140pjwn14V4H0DjX1pjR4tUKg=; b=hYcYFqI4AAld1D0/Mni3mND8y1/1K42fwuo4pJOCol215214H8jMlSth3TfCjgiSX2 HyICpjjmHv3tG9QMSJVc3O7K9ea40mi4yBvETmMgcevmsvrLJfV12Ty6MroxAS5veOw6 G5ltKlj708Ok6y38qIKwrQFCtY0Dbo0/U/tZJytYyIxdk5qFvOV/ci3smxj7lLDH//J2 cMyIL2WKLC2598/9X7UVVxJlU0AZrApAwaLGNwp9V4BAh5dJ0AWHH+wa8EzmYHaBZy4k +LKk723THxgV8M8tuJ8DC8qYaPXXdBiNJuNQI38vFxwArVGmVqgdhaK9rV66AFJw6Won IGJA== X-Gm-Message-State: AO0yUKXioQwlf3B6cyLGsmaPrMImCZwSoGbPqksB509nceSijeFgGZua WFYZh6RsVmM/XgqKJrqr4zIWtTFVUUdGDRc/sruC7AzxeNQ= X-Google-Smtp-Source: AK7set/K5l3VnTajjD1aHoPL/d1XQvyDXsBWUovtAfOYo/RPNwSMSUkcpxFq2T4NlwM37gB5Eu489286uDXTeqZ2GWA= X-Received: by 2002:a2e:58c:0:b0:295:944c:f335 with SMTP id 134-20020a2e058c000000b00295944cf335mr3771676ljf.1.1677794047847; Thu, 02 Mar 2023 13:54:07 -0800 (PST) MIME-Version: 1.0 References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> In-Reply-To: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> From: Dan Cross Date: Thu, 2 Mar 2023 16:53:31 -0500 Message-ID: To: Grant Taylor Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Message-ID-Hash: BVTV7VMKYTAVKLILHHM5LU3WMAG4S7SA X-Message-ID-Hash: BVTV7VMKYTAVKLILHHM5LU3WMAG4S7SA X-MailFrom: crossd@gmail.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: COFF X-Mailman-Version: 3.3.6b1 Precedence: list Subject: [COFF] Re: Requesting thoughts on extended regular expressions in grep. List-Id: Computer Old Farts Forum Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Thu, Mar 2, 2023 at 1:55=E2=80=AFPM Grant Taylor via COFF wrote: > I'd like some thoughts ~> input on extended regular expressions used > with grep, specifically GNU grep -e / egrep. > > What are the pros / cons to creating extended regular expressions like > the following: > > ^\w{3} > > vs: > > ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) Well, obviously the former matches any sequence 3 of alpha-numerics/underscores at the beginning of a string, while the latter only matches abbreviations of months in the western calendar; that is, the two REs are matching very different things (the latter is a strict subset of the former). But I suspect you mean in a more general sense. > Or: > > [ :[:digit:]]{11} ...do you really want to match a space, a colon and a single digit 11 times in a single string? > vs: > > ( 1| 2| 3| 4| 5| 6| 7| 8| > 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31) > (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]] Using character classes would greatly simplify what you're trying to do. It seems like this could be simplified to (untested) snippet: ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9] For this, I'd probably eschew `[:digit:]`. Named character classes are for handy locale support, or in lieu of typing every character in the alphabet (though we can use ranges to abbreviate that), but it kind of seems like that's not coming into play here and, IMHO, `[0-9]` is clearer in context. > I'm currently eliding the 61st (60) second, the 32nd day, and dealing > with February having fewer days for simplicity. It's not clear to me that dates, in their generality, can be matched with regular expressions. Consider leap years; you'd almost necessarily have to use backtracking for that, but I admit I haven't thought it through. > For matching patterns like the following in log files? > > Mar 2 03:23:38 > > I'm working on organically training logcheck to match known good log > entries. So I'm *DEEP* in the bowels of extended regular expressions > (GNU egrep) that runs over all logs hourly. As such, I'm interested in > making sure that my REs are both efficient and accurate or at least not > WILDLY badly structured. The pedantic part of me wants to avoid > wildcard type matches (\w), even if they are bounded (\w{3}), unless it > truly is for unpredictable text. `\w` is a GNU extension; I'd probably avoid it on portability grounds (though `\b` is very handy). The thing about regular expressions is that they describe regular languages, and regular languages are those for which there exists a finite automaton that can recognize the language. An important class of finite automata are deterministic finite automata; by definition, recognition by such automata are linear in the length of the input. However, construction of a DFA for any given regular expression can be superlinear (in fact, it can be exponential) so practically speaking, we usually construct non-deterministic finite automata (NDFAs) and "simulate" their execution for matching. NDFAs generalize DFAs (DFAs are a subset of NDFAs, incidentally) in that, in any non-terminal state, there can be multiple subsequent states that the machine can transition to given an input symbol. When executed, for any state, the simulator will transition to every permissible subsequent state simultaneously, discarding impossible states as they become evident. This implies that NDFA execution is superlinear, but it is bounded, and is O(n*m*e), where n is the length of the input, m is the number of nodes in the state transition graph corresponding to the NDFA, and e is the maximum number of edges leaving any node in that graph (for a fully connected graph, that would m, so this can be up to O(n*m^2)). Construction of an NDFA is O(m), so while it's slower to execute, it's actually possible to construct in a reasonable amount of time. Russ's excellent series of articles that Clem linked to gives details and algorithms. > I'd appreciate any feedback and recommendations from people who have > been using and / or optimizing (extended) regular expressions for longer > than I have been using them. > > Thank you for your time and input. In practical terms? Basically, don't worry about it too much. Egrep will generate an NDFA simulation that's going to be acceptably fast for all but the weirdest cases. - Dan C.