From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FROM,MAILING_LIST_MULTI autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 5356 invoked from network); 3 Mar 2023 13:48:26 -0000 Received: from minnie.tuhs.org (2600:3c01:e000:146::1) by inbox.vuxu.org with ESMTPUTF8; 3 Mar 2023 13:48:26 -0000 Received: from minnie.tuhs.org (localhost [IPv6:::1]) by minnie.tuhs.org (Postfix) with ESMTP id 8FAB643336; Fri, 3 Mar 2023 23:48:24 +1000 (AEST) Received: from mail-lf1-x130.google.com (mail-lf1-x130.google.com [IPv6:2a00:1450:4864:20::130]) by minnie.tuhs.org (Postfix) with ESMTPS id A90E343334 for ; Fri, 3 Mar 2023 23:48:17 +1000 (AEST) Received: by mail-lf1-x130.google.com with SMTP id f18so3689420lfa.3 for ; Fri, 03 Mar 2023 05:48:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=o7ZFHd+bI26HX8ryKtVcdcbgJ9X6dOvCFGdJ20IUIT8=; b=g97jdfg4zJsi9u2IekJCvloi+xNbMQazZvQlgyQRWKSAaqMIZOSdCBA9E3wFqEo4b2 p4hR04zF9k/kuOYgtQb9xDLmy2ab1+NAzY6bnRBZSIAz3zNKM+yXDUyn5EjNDbO1mrJh x3JDmFjtN/klg/avv2TVUNqCwX62AWB7eJQd6j0Zj1pVz14KQfQ6GJe8KdF/3M6+vDoI oOKTox6cfVNTfptltdXWI9LAgjzDX5IhIr9Ep+2VkJQ2sf9qAmrzDu5pFzbkBE7ekpQm tY8tiv+IwWfsd+OUG4tZGeJJoU/NE8OeB7VGoWlPYWwbiSL7Xtj68N9ud/KEyicsmMXP BfNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=o7ZFHd+bI26HX8ryKtVcdcbgJ9X6dOvCFGdJ20IUIT8=; b=7quGkmrK7fFAPQlPCqMdJjHOKNhAw5+8yMdph+hKSai5+BsT3+KY7/4WpKFAiJuHIu j4EGWRcyO7fJ64IXr50L7q1hcfOROUEfOl4UyN2d5iHilx8cjHE77ykHE10P8CEexg5t WQABcSifcCnnJ4moGnzs1p08/5JVWpiHRcM7pCbd8+YGCio5Ji2xI5lPQFmSl1jBxv+l VjJOxYGHGvklvAbwGVk6zqYTAr2KuljDxBXqykzE5jn38I86H1MLtszJs8BXusTuv9pu gMXA+skSXy+62FcMQI9csBVcAaYOherc1VwcYrQvO/rzHtlKI/NU7c/sOPKo2NCQKtUc Yx8A== X-Gm-Message-State: AO0yUKVUNvI1+5KG1Z5P/uBGPAQBsrDMiUbTTc+znsMTHiLR/swhSg1n 2kji5Z8VDA4ENNjZPXhC5Mg9JjViYPx223M/DQzyBl+l X-Google-Smtp-Source: AK7set+fAqOyiQI1sScj60hQHDFgXbch3NN63TkpMZOYu8R5mJrQ/AvEqMOAS7bLgfbS25MaCqoW4L05Q+WbHj0iG24= X-Received: by 2002:a05:6512:20d:b0:4d5:ca43:7047 with SMTP id a13-20020a056512020d00b004d5ca437047mr636088lfo.10.1677851295776; Fri, 03 Mar 2023 05:48:15 -0800 (PST) MIME-Version: 1.0 References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net> <1519cce3-1c38-8a9c-cfdd-b39484bd163b@spamtrap.tnetconsulting.net> In-Reply-To: <1519cce3-1c38-8a9c-cfdd-b39484bd163b@spamtrap.tnetconsulting.net> From: Dan Cross Date: Fri, 3 Mar 2023 08:47:39 -0500 Message-ID: To: Grant Taylor Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Message-ID-Hash: 64C3RZRD4BVKKLSOKEEKCPLTF5CO3VVL X-Message-ID-Hash: 64C3RZRD4BVKKLSOKEEKCPLTF5CO3VVL X-MailFrom: crossd@gmail.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: coff@tuhs.org X-Mailman-Version: 3.3.6b1 Precedence: list Subject: [COFF] Re: Requesting thoughts on extended regular expressions in grep. List-Id: Computer Old Farts Forum Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Thu, Mar 2, 2023 at 10:53=E2=80=AFPM Grant Taylor via COFF wrote: >[snip > Here's an example of the full RE: > > ^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+ > postfix/msa/smtpd\[[[:digit:]]+\]: timeout after STARTTLS from > [._[:alnum:]-]+\[[.:[:xdigit:]]+\]$ > > As you can see the "[ :[:digit:]]{11}" is actually only a sub-part of a > larger RE and there is bounding & delimiting around the subpart. Oh, for sure; to be clear, it was obvious that in the earlier discussion the original was just part of something larger. FWIW, this RE seems ok to me; the additional context makes it unlikely to match something else accidentally. > This is to match a standard message from postfix via standard SYSLOG. > > > Ah. I suspect this relies on domain knowledge about the format of log > > lines to match reliably. Otherwise it could match, `___ 123 456:789` > > which is probably not what you are expecting. > > Yep. > > Though said domain knowledge isn't anything special in and of itself. It needn't be special. The point is simply that there's some external knowledge that can be brought to bear to guide the shape of the REs. In this case, you know that log lines won't begin with `___ 123 456:789` or other similar junk. > [snip] > > Basically, a regular expression is a regular expression if you can buil= d > > a machine with no additional memory that can tell you whether or not a > > given string matches the RE examining its input one character at a time= . > > I /think/ that I could build a complex nested tree of switch statements > to test each character to see if things match what they should or not. > Though I would need at least one variable / memory to hold absolutely > minimal state to know where I am in the switch tree. I think a number > to identify the switch statement in question would be sufficient. So > I'm guessing two bytes of variable and uncounted bytes of program code. Kinda. The "machine" in this case is actually an abstraction, like a Turing machine. The salient point here is that REs map to finite state machines, and in particular, one need not keep (say) a stack of prior states when simulating them. Note that even in an NDFA simulation, where one keeps track of what states one may be in, one doesn't need to keep track of how one got into those states. Obviously in a real implementation you've got the program counter, register contents, local variables, etc, all of which consume "memory" in the conventional sense. But the point is that you don't need additional memory proportional to anything other than the size of the RE. DFA implementation could be implemented entirely with `switch` and `goto` if one wanted, as opposed to a bunch of mutually recursive function calls, NDFA simulation similarly except that you need some (bounded) additional memory to hold the active set of states. Contrast this with a pushdown automata, which can parse a context-free language, in which a stack is maintained that can store additional information relative to the input (for example, an already seen character). Pushdown automata can, for example, recognize matched parenthesis while regular languages cannot. Anyway, sorry, this is all rather more theoretical than is perhaps interesting or useful. Bottom line is, I think your REs are probably fine. `egrep` will complain at you if they are not, and I wouldn't worry too much about optimizing them: I'd "stop" whenever you're happy that you've got something understandable that matches what you want it to match. - Dan C.