From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,MAILING_LIST_MULTI autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 7206 invoked from network); 3 Mar 2023 01:06:06 -0000 Received: from minnie.tuhs.org (2600:3c01:e000:146::1) by inbox.vuxu.org with ESMTPUTF8; 3 Mar 2023 01:06:06 -0000 Received: from minnie.tuhs.org (localhost [IPv6:::1]) by minnie.tuhs.org (Postfix) with ESMTP id 26CD1435FD; Fri, 3 Mar 2023 11:06:05 +1000 (AEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tuhs.org; s=dkim; t=1677805565; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type:in-reply-to:in-reply-to: references:references:list-id:list-help:list-owner:list-unsubscribe: list-subscribe:list-post; bh=hJERzU56oPrxpSx1vZd3HXquxkg5q95aPCyN8DKiSxM=; b=Yis35r/2TkbuQUnrjfrkovu7dleYmvKwx8lcyiqgvNY3bl1+K665/e4TxxPC3rFVdDAop/ KUUZxIwWNeogWO8P0ljxruQgvp6yhyCOfqHC2WU+cuH2TKaMSBn7gn46fj+59QbuR5x7o0 KKsLvmL+7zgQcaFaZq/HIVfd/J6g8ig= Received: from tncsrv06.tnetconsulting.net (tncsrv06.tnetconsulting.net [IPv6:2600:3c00:e000:1e9::8849]) by minnie.tuhs.org (Postfix) with ESMTPS id 1B928435EA for ; Fri, 3 Mar 2023 11:05:59 +1000 (AEST) Received: from Contact-TNet-Consulting-Abuse-for-assistance by tncsrv06.tnetconsulting.net (8.15.2/8.15.2/Debian-3) with ESMTPSA id 32315w9h002822 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Thu, 2 Mar 2023 19:05:58 -0600 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tnetconsulting.net; s=2019; t=1677805558; bh=hJERzU56oPrxpSx1vZd3HXquxkg5q95aPCyN8DKiSxM=; h=Subject:To:References:From:Message-ID:Date:User-Agent: MIME-Version:In-Reply-To:Content-Type:Cc:Content-Disposition: Content-Language:Content-Transfer-Encoding:Content-Type:Date:From: In-Reply-To:Message-ID:MIME-Version:References:Reply-To: Resent-Date:Resent-From:Resent-To:Resent-Cc:Sender:Subject:To: User-Agent; b=Qrz3BZeLYvjLadi+Y4irPladKRTIJTcJ81vcykziHDlXSUfcCwCvEPG2OMRSS3jm+ RAhW8zMW3AlNgzs55vN8pU1nfaS/cGa3Sn5j22IZ8WhaHsca3CEvglKARTgKgrHjui rOGhDKz4c2YTJCuFBCTOUvKJCdCBazS6hJ0k7mt8= To: coff@tuhs.org References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> Organization: TNet Consulting Message-ID: <688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net> Date: Thu, 2 Mar 2023 18:05:51 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg=sha-256; boundary="------------ms050407080609000607010303" Message-ID-Hash: 3SQNRE2SWKGP4PKSEYGOO6A4QDEGI34Z X-Message-ID-Hash: 3SQNRE2SWKGP4PKSEYGOO6A4QDEGI34Z X-MailFrom: gtaylor@tnetconsulting.net X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.6b1 Precedence: list Subject: [COFF] Re: Requesting thoughts on extended regular expressions in grep. List-Id: Computer Old Farts Forum Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: From: Grant Taylor via COFF Reply-To: Grant Taylor This is a cryptographically signed message in MIME format. --------------ms050407080609000607010303 Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable On 3/2/23 2:53 PM, Dan Cross wrote: > Well, obviously the former matches any sequence 3 of=20 > alpha-numerics/underscores at the beginning of a string, while the=20 > latter only matches abbreviations of months in the western calendar;=20 > that is, the two REs are matching very different things (the latter=20 > is a strict subset of the former). I completely agree with you. That's also why I'm wanting to start=20 utilizing the latter, more specific RE. But I don't know where the line = of over complicating things is to avoid crossing it. > But I suspect you mean in a more general sense. Yes and no. Does the comment above clarify at all? > ...do you really want to match a space, a colon and a single digit=20 > 11 times ... Yes. > ... in a single string? What constitutes a single string? ;-) I sort of rhetorically ask. The log lines start with MMM dd hh:mm:ss Where: - MMM is the month abbreviation - dd is the day of the month - hh is the hour of the day - mm is the minute of the hour - ss is the second of the minute So, yes, there are eleven characters that fall into the class consisting = of a space or a colon or a number. Is that a single string? It depends what you're looking at, the=20 sequences of non white space in the log? No. The patter that I'm=20 matching ya. > Using character classes would greatly simplify what you're trying to=20 > do. It seems like this could be simplified to (untested) snippet: Agreed. I'm starting with the examples that came with; "^\w{3} [=20 :[:digit:]]{11}", the logcheck package that I'm working with and=20 evaluating what I want to do. I actually like the idea of dividing out the following: - months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, and Dec - months that have 30 days: Apr, Jun, Sep, Nov - month that have 28/29 days: Feb > ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9] Aside: Why do you have the double square brackets in "[12][[0-9]]"? > For this, I'd probably eschew `[:digit:]`. Named character classes=20 > are for handy locale support, or in lieu of typing every character=20 > in the alphabet (though we can use ranges to abbreviate that), but=20 > it kind of seems like that's not coming into play here and, IMHO,=20 > `[0-9]` is clearer in context. ACK "[[:digit:]]+" was a construct that I'm parroting. It and=20 [.:[:xdigit:]]+ are good for some things. But they definitely aren't=20 the best for all things. Hence trying to find the line of being more accurate without going too fa= r. > It's not clear to me that dates, in their generality, can be=20 > matched with regular expressions. Consider leap years; you'd almost=20 > necessarily have to use backtracking for that, but I admit I haven't=20 > thought it through. Given the context that these extended regular expressions are going to=20 be used in, logcheck -- filtering out known okay log entries to email=20 what doesn't get filtered -- I'm okay with having a few things slip=20 through like leap day / leap seconds / leap frogs. > `\w` is a GNU extension; I'd probably avoid it on portability grounds=20 > (though `\b` is very handy). I hear, understand, and acknowledge your concern. At present, these=20 filters are being used in a package; logcheck, which I believe is=20 specific to Debian and ilk. As such, GNU grep is very much a thing. I'm also not a fan of the use of `\w` and would prefer to (...|...) thing= s. > The thing about regular expressions is that they describe regular=20 > languages, and regular languages are those for which there exists a=20 > finite automaton that can recognize the language. An important class=20 > of finite automata are deterministic finite automata; by definition,=20 > recognition by such automata are linear in the length of the input. >=20 > However, construction of a DFA for any given regular expression can be = > superlinear (in fact, it can be exponential) so practically speaking,=20 > we usually construct non-deterministic finite automata (NDFAs) and=20 > "simulate" their execution for matching. NDFAs generalize DFAs (DFAs=20 > are a subset of NDFAs, incidentally) in that, in any non-terminal=20 > state, there can be multiple subsequent states that the machine can=20 > transition to given an input symbol. When executed, for any state,=20 > the simulator will transition to every permissible subsequent state=20 > simultaneously, discarding impossible states as they become evident. >=20 > This implies that NDFA execution is superlinear, but it is bounded,=20 > and is O(n*m*e), where n is the length of the input, m is the number=20 > of nodes in the state transition graph corresponding to the NDFA, and=20 > e is the maximum number of edges leaving any node in that graph (for=20 > a fully connected graph, that would m, so this can be up to O(n*m^2)). = > Construction of an NDFA is O(m), so while it's slower to execute, it's = > actually possible to construct in a reasonable amount of time. Russ's=20 > excellent series of articles that Clem linked to gives details and=20 > algorithms. I only vaguely understand those three paragraphs as they are deeper=20 computer science than I've gone before. I think I get the gist of them but could not explain them if my life=20 depended upon it. > In practical terms? Basically, don't worry about it too much. Egrep=20 > will generate an NDFA simulation that's going to be acceptably fast=20 > for all but the weirdest cases. ACK It sounds like I can make any reasonable extended regular expression a=20 human can read and I'll probably be good. Thank you for the detailed response Dan. :-) --=20 Grant. . . . unix || die --------------ms050407080609000607010303 Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" Content-Description: S/MIME Cryptographic Signature MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgEFADCABgkqhkiG9w0BBwEAAKCC CzowggUiMIIECqADAgECAhEAw8IZWQHDVuWWKHZeojBgoDANBgkqhkiG9w0BAQsFADCBljEL MAkGA1UEBhMCR0IxGzAZBgNVBAgTEkdyZWF0ZXIgTWFuY2hlc3RlcjEQMA4GA1UEBxMHU2Fs Zm9yZDEYMBYGA1UEChMPU2VjdGlnbyBMaW1pdGVkMT4wPAYDVQQDEzVTZWN0aWdvIFJTQSBD bGllbnQgQXV0aGVudGljYXRpb24gYW5kIFNlY3VyZSBFbWFpbCBDQTAeFw0yMjExMTQwMDAw MDBaFw0yMzExMTQyMzU5NTlaMCsxKTAnBgkqhkiG9w0BCQEWGmd0YXlsb3JAdG5ldGNvbnN1 bHRpbmcubmV0MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAzOnBjTJUlBTzN81c PlYErJc9kEbTI/hXq0NA6ZoG4VM6puYTEXtITANjgX+NRwwHjldESnC8dvh6Mx5ckEk9sWoD l8Yr/dWhF3s4fGxAX5ziOeuBI/yX7rKJn6DOwclV3C6dyt3zrLB6LOiF4gA+lk/o3EbOwoPh pW2MqAywy18OIvzfmEXKdya8E/uIP4v/8AHmtakxHfmZ33Krbwh2oia69esRKc7q2i3Jh+ar Tf3PuZJETd86Sb0Lz1+3zAXcYko2/3G9O9AwtUSDvkx5IUKieG8R4a8HLwuUTBNIsJ0qOdmv 4hUjc3IsP0jN+xebTE4w7PheolE/OStiFshpKQIDAQABo4IB0zCCAc8wHwYDVR0jBBgwFoAU CcDy/AvalNtf/ivfqJlCz8ngrQAwHQYDVR0OBBYEFPUkNRFsHVlNMgaz3G4kfNa8DU4VMA4G A1UdDwEB/wQEAwIFoDAMBgNVHRMBAf8EAjAAMB0GA1UdJQQWMBQGCCsGAQUFBwMEBggrBgEF BQcDAjBABgNVHSAEOTA3MDUGDCsGAQQBsjEBAgEBATAlMCMGCCsGAQUFBwIBFhdodHRwczov L3NlY3RpZ28uY29tL0NQUzBaBgNVHR8EUzBRME+gTaBLhklodHRwOi8vY3JsLnNlY3RpZ28u Y29tL1NlY3RpZ29SU0FDbGllbnRBdXRoZW50aWNhdGlvbmFuZFNlY3VyZUVtYWlsQ0EuY3Js MIGKBggrBgEFBQcBAQR+MHwwVQYIKwYBBQUHMAKGSWh0dHA6Ly9jcnQuc2VjdGlnby5jb20v U2VjdGlnb1JTQUNsaWVudEF1dGhlbnRpY2F0aW9uYW5kU2VjdXJlRW1haWxDQS5jcnQwIwYI KwYBBQUHMAGGF2h0dHA6Ly9vY3NwLnNlY3RpZ28uY29tMCUGA1UdEQQeMByBGmd0YXlsb3JA dG5ldGNvbnN1bHRpbmcubmV0MA0GCSqGSIb3DQEBCwUAA4IBAQBdVEYkwnfj7/0fx6R9ll/7 F1HeOL+Q/gzdd4bKpaY3/dkCyHVtx2dAMixzM4YGIq4rDsbhPK1MXqQAS89B786rG9XjWKgM VlgiBHir/9eQxhvX4AbQx1eJdCXNKTMJJwyIG2qlvuor/8H8//ZIjJuBgYAzW4TZREolhzVP 4g92+De1zyWW+3bESGHgx1E1+tkdvYeQATt7wkUtsEkn05MUHGAfRWt0tE3C321ajqSuFtxC VCeGvGusV8+3rw2vsqVG/mkTsmn1EAtq0jGhVgwIgQO8soFSRt/3zWibnVk1aRrXvy45WMGv an16R0/HQp8oLG3MYq++Vq6CFBbIG+9OMIIGEDCCA/igAwIBAgIQTZQsENQ74JQJxYEtOisG TzANBgkqhkiG9w0BAQwFADCBiDELMAkGA1UEBhMCVVMxEzARBgNVBAgTCk5ldyBKZXJzZXkx FDASBgNVBAcTC0plcnNleSBDaXR5MR4wHAYDVQQKExVUaGUgVVNFUlRSVVNUIE5ldHdvcmsx LjAsBgNVBAMTJVVTRVJUcnVzdCBSU0EgQ2VydGlmaWNhdGlvbiBBdXRob3JpdHkwHhcNMTgx MTAyMDAwMDAwWhcNMzAxMjMxMjM1OTU5WjCBljELMAkGA1UEBhMCR0IxGzAZBgNVBAgTEkdy ZWF0ZXIgTWFuY2hlc3RlcjEQMA4GA1UEBxMHU2FsZm9yZDEYMBYGA1UEChMPU2VjdGlnbyBM aW1pdGVkMT4wPAYDVQQDEzVTZWN0aWdvIFJTQSBDbGllbnQgQXV0aGVudGljYXRpb24gYW5k IFNlY3VyZSBFbWFpbCBDQTCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAMo87ZQK Qf/e+Ua56NY75tqSvysQTqoavIK9viYcKSoq0s2cUIE/bZQu85eoZ9X140qOTKl1HyLTJbaz Gl6nBEibivHbSuejQkq6uIgymiqvTcTlxZql19szfBxxo0Nm9l79L9S+TZNTEDygNfcXlkHK RhBhVFHdJDfqB6Mfi/Wlda43zYgo92yZOpCWjj2mz4tudN55/yE1+XvFnz5xsOFbme/SoY9W Aa39uJORHtbC0x7C7aYivToxuIkEQXaumf05Vcf4RgHs+Yd+mwSTManRy6XcCFJE6k/LHt3n dD3sA3If/JBz6OX2ZebtQdHnKav7Azf+bAhudg7PkFOTuRMCAwEAAaOCAWQwggFgMB8GA1Ud IwQYMBaAFFN5v1qqK0rPVIDh2JvAnfKyA2bLMB0GA1UdDgQWBBQJwPL8C9qU21/+K9+omULP yeCtADAOBgNVHQ8BAf8EBAMCAYYwEgYDVR0TAQH/BAgwBgEB/wIBADAdBgNVHSUEFjAUBggr BgEFBQcDAgYIKwYBBQUHAwQwEQYDVR0gBAowCDAGBgRVHSAAMFAGA1UdHwRJMEcwRaBDoEGG P2h0dHA6Ly9jcmwudXNlcnRydXN0LmNvbS9VU0VSVHJ1c3RSU0FDZXJ0aWZpY2F0aW9uQXV0 aG9yaXR5LmNybDB2BggrBgEFBQcBAQRqMGgwPwYIKwYBBQUHMAKGM2h0dHA6Ly9jcnQudXNl cnRydXN0LmNvbS9VU0VSVHJ1c3RSU0FBZGRUcnVzdENBLmNydDAlBggrBgEFBQcwAYYZaHR0 cDovL29jc3AudXNlcnRydXN0LmNvbTANBgkqhkiG9w0BAQwFAAOCAgEAQUR1AKs5whX13o6V bTJxaIwA3RfXehwQOJDI47G9FzGR87bjgrShfsbMIYdhqpFuSUKzPM1ZVPgNlT+9istp5UQN RsJiD4KLu+E2f102qxxvM3TEoGg65FWM89YN5yFTvSB5PelcLGnCLwRfCX6iLPvGlh9j30lK zcT+mLO1NLGWMeK1w+vnKhav2VuQVHwpTf64ZNnXUF8p+5JJpGtkUG/XfdJ5jR3YCq8H0OPZ kNoVkDQ5CSSF8Co2AOlVEf32VBXglIrHQ3v9AAS0yPo4Xl1FdXqGFe5TcDQSqXh3TbjugGnG +d9yZX3lB8bwc/Tn2FlIl7tPbDAL4jNdUNA7jGee+tAnTtlZ6bFz+CsWmCIb6j6lDFqkXVsp +3KyLTZGXq6F2nnBtN4t5jO3ZIj2gpIKHAYNBAWLG2Q2fG7Bt2tPC8BLC9WIM90gbMhAmtMG quITn/2fORdsNmaV3z/sPKuIn8DvdEhmWVfh0fyYeqxGlTw0RfwhBlakdYYrkDmdWC+XszE1 9GUi8K8plBNKcIvyg2omAdebrMIHiAHAOiczxX/aS5ABRVrNUDcjfvp4hYbDOO6qHcfzy/uY 0fO5ssebmHQREJJA3PpSgdVnLernF6pthJrGkNDPeUI05svqw1o5A2HcNzLOpklhNwZ+4uWY LcAi14ACHuVvJsmzNicxggQ1MIIEMQIBATCBrDCBljELMAkGA1UEBhMCR0IxGzAZBgNVBAgT EkdyZWF0ZXIgTWFuY2hlc3RlcjEQMA4GA1UEBxMHU2FsZm9yZDEYMBYGA1UEChMPU2VjdGln byBMaW1pdGVkMT4wPAYDVQQDEzVTZWN0aWdvIFJTQSBDbGllbnQgQXV0aGVudGljYXRpb24g YW5kIFNlY3VyZSBFbWFpbCBDQQIRAMPCGVkBw1bllih2XqIwYKAwDQYJYIZIAWUDBAIBBQCg ggJZMBgGCSqGSIb3DQEJAzELBgkqhkiG9w0BBwEwHAYJKoZIhvcNAQkFMQ8XDTIzMDMwMzAx MDU1MVowLwYJKoZIhvcNAQkEMSIEIH6PAELVC6HuxISBsdHZaJSwM0RGxnoMaEnr5YN004jx MGwGCSqGSIb3DQEJDzFfMF0wCwYJYIZIAWUDBAEqMAsGCWCGSAFlAwQBAjAKBggqhkiG9w0D BzAOBggqhkiG9w0DAgICAIAwDQYIKoZIhvcNAwICAUAwBwYFKw4DAgcwDQYIKoZIhvcNAwIC ASgwgb0GCSsGAQQBgjcQBDGBrzCBrDCBljELMAkGA1UEBhMCR0IxGzAZBgNVBAgTEkdyZWF0 ZXIgTWFuY2hlc3RlcjEQMA4GA1UEBxMHU2FsZm9yZDEYMBYGA1UEChMPU2VjdGlnbyBMaW1p dGVkMT4wPAYDVQQDEzVTZWN0aWdvIFJTQSBDbGllbnQgQXV0aGVudGljYXRpb24gYW5kIFNl Y3VyZSBFbWFpbCBDQQIRAMPCGVkBw1bllih2XqIwYKAwgb8GCyqGSIb3DQEJEAILMYGvoIGs MIGWMQswCQYDVQQGEwJHQjEbMBkGA1UECBMSR3JlYXRlciBNYW5jaGVzdGVyMRAwDgYDVQQH EwdTYWxmb3JkMRgwFgYDVQQKEw9TZWN0aWdvIExpbWl0ZWQxPjA8BgNVBAMTNVNlY3RpZ28g UlNBIENsaWVudCBBdXRoZW50aWNhdGlvbiBhbmQgU2VjdXJlIEVtYWlsIENBAhEAw8IZWQHD VuWWKHZeojBgoDANBgkqhkiG9w0BAQEFAASCAQBTJiqzDE95wvNhQ1R2NPNx3vpFrnHOJNOW /nFtNSUAH3YeQEY9ttUoOorX1w1Se5Q6oFoayXBet1l5rQSj3v5jKBOTSJKsS7HdcWaPPjt5 VHg3uv+1PFuJ3TD7/piblLTyy8hEgSzKV8kDe733pLyq2E7RsTn0+rPXum2Kj/iyxep/P11F kp6AHdnS5/TytaXumYmdvNcZ9Ej7gK8pZFosvmQ9YS8ag5vaXXrPwh96n9QI66UmIs7nxwSS 9x+bdj26pj36HQa39NopncQJsL+GmpCxhDShMNhJlGLhS6amqvlIXkhYi+QZqAyI+ZJC+1E0 VObJkbja22fa/bqvDxV8AAAAAAAA --------------ms050407080609000607010303--