From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FROM,MAILING_LIST_MULTI autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 9883 invoked from network); 7 Mar 2023 18:33:50 -0000 Received: from minnie.tuhs.org (50.116.15.146) by inbox.vuxu.org with ESMTPUTF8; 7 Mar 2023 18:33:50 -0000 Received: from minnie.tuhs.org (localhost [IPv6:::1]) by minnie.tuhs.org (Postfix) with ESMTP id E4010411F2; Wed, 8 Mar 2023 04:33:48 +1000 (AEST) Received: from mail-lf1-x130.google.com (mail-lf1-x130.google.com [IPv6:2a00:1450:4864:20::130]) by minnie.tuhs.org (Postfix) with ESMTPS id 9E101411F0 for ; Wed, 8 Mar 2023 04:33:39 +1000 (AEST) Received: by mail-lf1-x130.google.com with SMTP id j11so18258954lfg.13 for ; Tue, 07 Mar 2023 10:33:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1678214017; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=3j4LE56TV/rzxw8SQea+k8ePD0vC54ukh3CTd0dectM=; b=ekVvFF9dYK3pmaIfAVPD39PK/YFnj/X1SYtG9hacXHXlp85KVNOTXHhf//U9HQgIIe qsO8A1UjiRc8LXxj5eDGeWkYEa0/kr4cad7Ld8c+Xh913uE2d4U9vIsFOAUT12GsaFZe orjMHXB+pl2mfPaA6dsTRRdxWfitshi22Vzt1ElpDeAL7199+wwbPvna9tNMDmcazQSH cvBygtVTWL2DNhzCqxLxxVdG3Qygh01ra8HOqvonhU8hKnQ/nWYc5ICPnA4e8opcCghh Y2Akqi515pPlEifrFTAOPpJkYxu0sQMgRxkjgqsjnJP7lp0lq0rv6oBjTfJM8yo6wd/d 8QiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678214017; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3j4LE56TV/rzxw8SQea+k8ePD0vC54ukh3CTd0dectM=; b=awxMt3JmPM6UckP/9yHibvCkIakNqvZdtBsKhePfCZCKzn3/kmREDfJaYVdPdsbyGA NbidK1K/JF4IxGEY7pvLYgVKeHlePm3NiiTiyC8pL7Mssb1jQ8Ps96hsr0t03422+RUZ gAG7tBhX1xuDjBYAUJ5PTvostAHGlegZHbh9Vt61WhnE+zxvJwR1JAqTFj5YkYcoIhaI Rhn6f8xdMe/R83fno5byKUU4XOFQo/I0Q1dnWH2Yu0oeULkW7hT9QXwFM6Lm2TbWNtZI ktPD6S4B9Qw5uPUATFMsRghdTTeZUQl0Ob0DJ/2QfgTbIZ7YoIOkcrlVHxQXBanQQ7es J4Iw== X-Gm-Message-State: AO0yUKVEOMrKbxkFgyJA96bZfAG2yo1dhZ2nNy557isG8npLlz+Y/VBN 6neYuWuqgqoNoTVW+iiwsFfydsFljNz0yg7dcL4= X-Google-Smtp-Source: AK7set8rteGnypb/XJ/2WQ3M+aO1NSziw+MyI6bP84dukizPDltZEsm8/G9vkNhTRBFZQZDU/CyHT+SoLjbBVZ/2vqE= X-Received: by 2002:ac2:52a1:0:b0:4db:1989:8d92 with SMTP id r1-20020ac252a1000000b004db19898d92mr4812323lfm.10.1678214017081; Tue, 07 Mar 2023 10:33:37 -0800 (PST) MIME-Version: 1.0 References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230307014311.GN5398@mcvoy.com> <20230307173451.D94B421C9B@orac.inputplus.co.uk> In-Reply-To: <20230307173451.D94B421C9B@orac.inputplus.co.uk> From: Dan Cross Date: Tue, 7 Mar 2023 13:33:00 -0500 Message-ID: To: Ralph Corderoy Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Message-ID-Hash: T4XZYVSIPFZUYERBM4AZ7TRPXS3M2WPX X-Message-ID-Hash: T4XZYVSIPFZUYERBM4AZ7TRPXS3M2WPX X-MailFrom: crossd@gmail.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: coff@tuhs.org X-Mailman-Version: 3.3.6b1 Precedence: list Subject: [COFF] Re: Requesting thoughts on extended regular expressions in grep. List-Id: Computer Old Farts Forum Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Tue, Mar 7, 2023 at 12:34=E2=80=AFPM Ralph Corderoy wrote: > > I'm sure that with enough effort, it _is_ possible to make a 50K RE > > that is understandable to mere mortals. But it begs the question: why > > bother? > > It could be the quickest way to express the intent. Ok, I challenge you to find me anything for which the quickest way to express the intent is a 50 *thousand* symbol regular expression. :-) > > The answer to that question, in my mind, shows the difference between > > a clever programmer and a pragmatic engineer. > > I think those two can overlap. :-) Indeed they can. But I have grave doubts that this is a good example of such overlap. > > I submit that it's time to reach for another tool well before you get > > to an RE that big > > Why, if the grammar is type three in Chomsky's hierarchy, i.e. a regular > grammar? I think sticking with code aimed at regular grammars, or more > likely regexps, will do better than, say, a parser generator for a > type-two context-free grammar. Is there an extant, non-theoretical, example of such a grammar? > As well as the lex(1) family, there's Ragel as another example. > http://www.colm.net/open-source/ragel/ This is moving the goal posts more than a bit. I'm suggesting that a 50k-symbol RE is unlikely to be the best solution to any reasonable problem. A state-machine generator, even one with 50k statements, is not a 50k RE. > > 3. Optimizing regular expressions. You're still bound by the known > > theoretical properties of finite automata here. > > Well, we're back to the RE v. regexp again. /^[0-9]+\.jpeg$/ is matched > by some engines by first checking the last five bytes are =E2=80=98.jpeg= =E2=80=99. ...in general, in order to find the end, won't _something_ have to traverse the entire input? (Note that I said, "in general". Allusions to mmap'ed files or seeking to the end of a file are not general, since they don't apply well to important categories of input sources, such as pipes or network connections.) > $ debugperl -Dr -e \ > > '"123546789012354678901235467890123546789012.jpg" =3D~ /^[0-9]+= \.jpeg$/' > ... > Matching REx "^[0-9]+\.jpeg$" against "123546789012354678901235467890= 123546789012.jpg" > Intuit: trying to determine minimum start position... > doing 'check' fbm scan, [1..46] gave -1 > Did not find floating substr ".jpeg"$... > Match rejected by optimizer > Freeing REx: "^[0-9]+\.jpeg$" > $ > > Boyer-Moore string searching can be used. Common-subregexp-elimination > can spot repetitive fragment of regexp and factor them into a single set > of states along with pairing the route into them with the appropriate > route out. Well, that's what big-O notation accounts for. I'm afraid none of this really changes the time bounds, however, when applied in general. > The more regexp engines are optimised, the more benefit to the > programmer from sticking to a regexp rather than, say, ad hoc parsing. This is comparing apples and oranges. There may be all sorts of heuristics that we can apply to specific regular expressions to prune the search space, and that's great. But by their very nature, heuristics are not always generally applicable. As an analogy, we know that we cannot solve _the_ halting problem, but we also know that we can solve _many_ halting problem_s_. For example, a compiler can recognize that any of, `for(;;);` or `while(1);` or `loop {}` do not halt, and so on, ad nauseum, but even if some oracle can recognize arbitrarily many such halting problems, we still haven't solved the general problem. > The theory of REs is interesting and important, but regexps deviate from > it ever more. Yup. My post should not be construed as suggesting that regexps are not useful, or that they should not be a part of a programmer's toolkit. My post _should_ be construed as a suggestion that they are not always the best solution, and a part of being an engineer is finding that dividing line. - Dan C.