From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FROM,MAILING_LIST_MULTI autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 17495 invoked from network); 3 Mar 2023 19:31:59 -0000 Received: from minnie.tuhs.org (2600:3c01:e000:146::1) by inbox.vuxu.org with ESMTPUTF8; 3 Mar 2023 19:31:59 -0000 Received: from minnie.tuhs.org (localhost [IPv6:::1]) by minnie.tuhs.org (Postfix) with ESMTP id 76191434A7; Sat, 4 Mar 2023 05:31:58 +1000 (AEST) Received: from mail-lj1-x22d.google.com (mail-lj1-x22d.google.com [IPv6:2a00:1450:4864:20::22d]) by minnie.tuhs.org (Postfix) with ESMTPS id BB0E5434A3 for ; Sat, 4 Mar 2023 05:31:52 +1000 (AEST) Received: by mail-lj1-x22d.google.com with SMTP id x6so3462643ljq.1 for ; Fri, 03 Mar 2023 11:31:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=QULimopC3ysQpdSvvWlpNJMVGzTSaWzUzD/BjbIeL3Y=; b=fuT9WdiVgqEJWOc3dBqn4GbBce2amb8CzPRbXVACiGp84C/yLim5VMy/JQ4MgSoKXu /f6+tmh4qapLuA1rE5P9wFBXBBczXyGIfJQpRpvFrGn2ksSkhsqihAzyizet8cGapOWN leVcc7qpl/mOzt8o6bS4ren1O66Ev025oTspst0K7NPuwDP/sChJwaW1lc9fNmZvtT73 WXKboY7He35ofpHUQxkammANlpMoX6Xm2IYpZDgni56Ji9cNawlpM0K7YkAcpGzyP0H+ 3dhrHfWWo97ye/505//WzSKxufsDQDq5EtTZXMflantXuEhA4vyiyexkwQjb+Bz7TIXV w2wA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=QULimopC3ysQpdSvvWlpNJMVGzTSaWzUzD/BjbIeL3Y=; b=ilTb8/KWsacRljO+zQiF0MX6Zqyu5C6AR2aIRM4DoDhR0InyC1ZJ2h6zeaBK4YLvnT VRoh36uxrM4jpSkKZZ+vsnaPPWWyvJvk/DciS5ki5GQHdDKNWa1p44TXQr0LkEiTbEBY ePmgvMPZwZ6HgOA+MGEeK5ZcbxCzfOjDm72Aj66TXV3yAOItriIPC6+f5joRTzC7Ds5p QrYYFkBjthFtZG7ZKHlNqbnoBmDDR7gpwVFaIfavkZWdWWhe41V/I9rIJ6StP0nrHy1O tHiS+QMytABms/LE7jCJyHsuTFE+NpQ+o03YlY2ym0vs0NZY4EHgBRB2sCFzMyy6vdnh AF/Q== X-Gm-Message-State: AO0yUKW37x+RD3wroxaeLgeNu1V2TYtwgbZFigoUp1+B9KlQNyH/ViSd +ald9kPDyiJc0hJl9T9EaBMNdSuycplaPTL+jfI= X-Google-Smtp-Source: AK7set+FO6im/ABIiqThLgMYW1F3rN5VGQB5wEnpZZI6kcbXiRirwrbvowoVbWxLoKeczL+Ypw43aFAc9lai00mWLRk= X-Received: by 2002:a05:651c:2322:b0:295:93eb:bab1 with SMTP id bi34-20020a05651c232200b0029593ebbab1mr934994ljb.1.1677871910547; Fri, 03 Mar 2023 11:31:50 -0800 (PST) MIME-Version: 1.0 References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> In-Reply-To: From: Dan Cross Date: Fri, 3 Mar 2023 14:31:14 -0500 Message-ID: To: Grant Taylor Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Message-ID-Hash: Z2T4FLLZZRWMY5XN56VV7ECPZIGHQDAY X-Message-ID-Hash: Z2T4FLLZZRWMY5XN56VV7ECPZIGHQDAY X-MailFrom: crossd@gmail.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: coff@tuhs.org X-Mailman-Version: 3.3.6b1 Precedence: list Subject: [COFF] Re: Requesting thoughts on extended regular expressions in grep. List-Id: Computer Old Farts Forum Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Fri, Mar 3, 2023 at 2:06=E2=80=AFPM Grant Taylor via COFF wrote: > Thank you all for very interesting and engaging comments & threads to > chase / pull / untangle. > > I'd like to expand / refine my original question a little bit. > > On 3/2/23 11:54 AM, Grant Taylor via COFF wrote: > > I'd like some thoughts ~> input on extended regular expressions used > > with grep, specifically GNU grep -e / egrep. > > While some reading of the references that Clem provided I came across > multiple indications that back-references can be problematic from a > performance stand point. > > So I'd like to know if all back-references are problematic, or if very > specific back-references are okay. The thing about backreferences is that they're not representable in the regular languages because they require additional state (the thing the backref refers to), so you cannot construct a DFA corresponding to them, nor an NDFA simulator (this is where Freidl gets things wrong!); you really need a pushdown automata and then you're in the domain of the context-free languages. Therefore, "regexps" that use back references are not actually regular expressions. Yet, popular engines support them...but how? Well, pretty much all of them use a backtracking implementation, which _can_ be exponential in both time and space. Now, that said, there are plenty of REs, even some with backrefs, that'll execute plenty fast enough on backtracking implementations; it really depends on the expressions in question and the size of strings you're trying to match against. But you lose the bounding guarantees DFAs and NDFAs provide. > Suppose I have the following two lines: > > aaa aaa > aaa bbb > > Does the following RE w/ back-reference introduce a big performance penal= ty? > > (aaa|bbb) \1 > > As in: > > % echo "aaa aaa" | egrep "(aaa|bbb) \1" > aaa aaa > > I can easily see how a back reference to something that is not a fixed > length can become a rabbit hole. But I'm wondering if a back-reference > to -- what I think is called -- an alternation (with length fixed in the > RE) is a performance hit or not. Well, it's more about the implementation strategy than the specific expression here. Could this become exponential? I don't think this one would, no; but others may, particularly if you use Kleene closures in the alternation. This _is_ something that appears in the wild, by the way, not just in theory; I did a change to Google's spelling service code to replace PCRE with re2 precisely because it was blowing up with exponential memory usage on some user input. The problems went away, but I had to rewrite a bunch of the REs involved. - Dan C.