From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/109995 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: luigi scarso Newsgroups: gmane.comp.tex.context Subject: Re: Off-topic: Struggles with LPEG grammar Date: Mon, 21 Dec 2020 14:47:05 +0100 Message-ID: References: <364BD99F-0BC4-4CAB-84AD-AED668455854@elvenkind.com> <2FDD073C-DC12-4235-BBDC-0A5F4EE6B1F7@elvenkind.com> Reply-To: mailing list for ConTeXt users Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0577500067460226195==" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="37615"; mail-complaints-to="usenet@ciao.gmane.io" To: mailing list for ConTeXt users Original-X-From: ntg-context-bounces@ntg.nl Mon Dec 21 14:48:03 2020 Return-path: Envelope-to: gctc-ntg-context-518@m.gmane-mx.org Original-Received: from zapf.boekplan.nl ([5.39.185.232] helo=zapf.ntg.nl) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1krLXe-0009fE-R3 for gctc-ntg-context-518@m.gmane-mx.org; Mon, 21 Dec 2020 14:48:02 +0100 Original-Received: from localhost (localhost [127.0.0.1]) by zapf.ntg.nl (Postfix) with ESMTP id D6A801C154A; Mon, 21 Dec 2020 14:47:25 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at zapf.boekplan.nl Original-Received: from zapf.ntg.nl ([127.0.0.1]) by localhost (zapf.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id qZTp2Bj6GgoZ; Mon, 21 Dec 2020 14:47:21 +0100 (CET) Original-Received: from zapf.ntg.nl (localhost [127.0.0.1]) by zapf.ntg.nl (Postfix) with ESMTP id B8B241C1556; Mon, 21 Dec 2020 14:47:21 +0100 (CET) Original-Received: from localhost (localhost [127.0.0.1]) by zapf.ntg.nl (Postfix) with ESMTP id 45EFE1C1556 for ; Mon, 21 Dec 2020 14:47:20 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at zapf.boekplan.nl Original-Received: from zapf.ntg.nl ([127.0.0.1]) by localhost (zapf.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HG8VFJigWNRB for ; Mon, 21 Dec 2020 14:47:17 +0100 (CET) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.167.41; helo=mail-lf1-f41.google.com; envelope-from=luigi.scarso@gmail.com; receiver= Original-Received: from mail-lf1-f41.google.com (mail-lf1-f41.google.com [209.85.167.41]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)) (No client certificate requested) by zapf.ntg.nl (Postfix) with ESMTPS id B6B4B1C154A for ; Mon, 21 Dec 2020 14:47:17 +0100 (CET) Original-Received: by mail-lf1-f41.google.com with SMTP id 23so23723688lfg.10 for ; Mon, 21 Dec 2020 05:47:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=BWMnHeCM+NrxbQFP9C9r0hlK6IZNPE9eM+g66yVAD88=; b=l4Awnb/OJFxob4BBHRAMyTvvpj33L8+R63//OdCPrTnCyr253KcQOwhucASSiaTQgP Jwlwc6Hox8oKC4AGhufDR4CpHz9uVz1tlP7sDbJUd0cYNrFy7qfTCiGQ0g+siH0Pn03A KkUIJ3pMfLpYXcKIt1PiQv5Aac+HySC/+4N4iLbMxJkZGZbova4w3oonVjebQn+eZS5J D7XpmjI2wKyd6EQ5n1z/tsJjdh5J4vQP7x0zLLVCauCz4z03CdHJWasCFkzGoN5zUjQW RZXoF2m8LocZp9OfR92ZTDdGHO4aH2uGkMi0hK3gKTP0hEsxbs9HC3RMB20W8aJhD6JS GOxg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=BWMnHeCM+NrxbQFP9C9r0hlK6IZNPE9eM+g66yVAD88=; b=MbW28IvZn3Zs+f7Dp/Fk1wJpP6wcOXSGNyIdzU12fwBBVPGBltnS3tSAtyVSpN4jMK XaW8BovAZvL7z0OC7vKw41d+Ww9HsJtEBDeTypf358FlQtNAVZFM46vy4PV6RT9kTSvG 9lxqxHPlJTgKwotJZ9hKU0iYwZ2uiXsoPUoAl0PJZNDfByfN1/vmsRmxLRKrhhhxCLWw 5gCobUic7jF8AXeloHEm71bk/+q5aTa6z6aj4srADVXaubdHjzc40AqtblVLxzVWiPxC 7ZSOWPSYIFgiqK/n5wZN9EJvU5X/iIlR5OKUss/pOeojHHZgnyb9PfvQASmzY5oeqor6 NJ0Q== X-Gm-Message-State: AOAM532uRMbkRW5OgKrhBagczFU3Cto2dAnxDg+vmWQ6BsaV5s6Xql1T s3UrMouNPLHW63xw1hmsfvkiNt5yEqZo47guE/Njhf8nqtVf6w== X-Google-Smtp-Source: ABdhPJzr+cns6SqwCNiCprf1Ja7zJJ7ssqH2NNngZA8BGrzJEt73IPnquVH0gpAIbxmyBKIuORN/e3zKUhp1y87jgAU= X-Received: by 2002:a2e:86d4:: with SMTP id n20mr7212178ljj.486.1608558436909; Mon, 21 Dec 2020 05:47:16 -0800 (PST) In-Reply-To: <2FDD073C-DC12-4235-BBDC-0A5F4EE6B1F7@elvenkind.com> X-BeenThere: ntg-context@ntg.nl X-Mailman-Version: 2.1.26 Precedence: list List-Id: mailing list for ConTeXt users List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ntg-context-bounces@ntg.nl Original-Sender: "ntg-context" Xref: news.gmane.io gmane.comp.tex.context:109995 Archived-At: --===============0577500067460226195== Content-Type: multipart/alternative; boundary="0000000000003b856605b6f9b210" --0000000000003b856605b6f9b210 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, Dec 21, 2020 at 2:36 PM Taco Hoekwater wrote: > > > > On 21 Dec 2020, at 14:08, Mojca Miklavec > wrote: > > > > Dear Taco, > > > > On Mon, 21 Dec 2020 at 13:46, Taco Hoekwater wrote: > >>> On 21 Dec 2020, at 13:16, Mojca Miklavec wrote: > >>> > >>> My only explanation would be that perhaps "^1" is so greedy that the > >>> rest of the pattern doesn't get found. But I don't want to believe > >>> that explanation. > >> > >> Which (of course) means that that is exactly what happens ;) > >> > >> The ones that match are > >> > >> ababbb (a (ba+bb) b) =3D> r4 r1(r3(r5 r4) r2(r5 r5)) r5 > >> abbbab (a (bb+ba) b) =3D> r4 r1(r2(r5 r5) r3(r5 r4)) r5 > >> > >> With the ^1, in the =E2=80=9Cbb=E2=80=9D cases the first =E2=80=9Cb=E2= =80=9D eats all three =E2=80=9Cb=E2=80=9Ds: > >> > >> ababbb fails the r5 at the end > >> > >> abbbab fails the first r2 already (since the second r5 therein never > happens) > > > > Is this a deliberate choice, a limitation of the grammar > > expressiveness, some misuse on my side that could/should/needs to be > > implemented in a different way, or does it count as a "bug" on the > > lpeg side? > > > > For example, I wouldn't expect a regexp "b+b" to fail on "bbb" just > > because "b+" would eat all three "b"s at once (the regexp "b+b" in > > fact finds "bbb", and I would expect a less-than-totally-greedy hit > > with lpeg as well). Or is my reasoning wrong here? > > PEGs are greedy by design, which is a consequence of the fact that PEGS d= o > not backtrack, which goes back to the underlying assumptive rule of PEGs > that there is one (and only one!) =E2=80=98correct=E2=80=99 way to parse = the input. > Allowing backtracking destroys that assumption and by doing so would > complicate the system to a level that would make it comparable to PCRE > (with all the associated penalties on processing speed and a much greater > codebase). > greedy vs non-greedy is one of the things that I always keep in mind when I start with lpeg, and regularly I fail to apply -- because I think in the "perl regex way". Anyway, http://www.gammon.com.au/lpeg has some good lines: e.g. this one (from the lpeg site) find the pattern anywhere in the line: function anywhere (p) return lpeg.P { p + 1 * lpeg.V(1) } end print (lpeg.match (anywhere ("dog"), target)) --=20 luigi --0000000000003b856605b6f9b210 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Mon, Dec 21, 2020 at 2:36 PM Taco = Hoekwater <taco@elvenkind.com&= gt; wrote:


> On 21 Dec 2020, at 14:08, Mojca Miklavec <mojca.miklavec.lists@gmail.com> wrote:
>
> Dear Taco,
>
> On Mon, 21 Dec 2020 at 13:46, Taco Hoekwater wrote:
>>> On 21 Dec 2020, at 13:16, Mojca Miklavec wrote:
>>>
>>> My only explanation would be that perhaps "^1" is so= greedy that the
>>> rest of the pattern doesn't get found. But I don't wan= t to believe
>>> that explanation.
>>
>> Which (of course) means that that is exactly what happens ;)
>>
>> The ones that match are
>>
>> ababbb (a (ba+bb) b) =3D> r4 r1(r3(r5 r4) r2(r5 r5)) r5
>> abbbab (a (bb+ba) b) =3D> r4 r1(r2(r5 r5) r3(r5 r4)) r5
>>
>> With the ^1, in the =E2=80=9Cbb=E2=80=9D cases the first =E2=80=9C= b=E2=80=9D eats all three =E2=80=9Cb=E2=80=9Ds:
>>
>> ababbb fails the r5 at the end
>>
>> abbbab fails the first r2 already (since the second r5 therein nev= er happens)
>
> Is this a deliberate choice, a limitation of the grammar
> expressiveness, some misuse on my side that could/should/needs to be > implemented in a different way, or does it count as a "bug" = on the
> lpeg side?
>
> For example, I wouldn't expect a regexp "b+b" to fail on= "bbb" just
> because "b+" would eat all three "b"s at once (the= regexp "b+b" in
> fact finds "bbb", and I would expect a less-than-totally-gre= edy hit
> with lpeg as well). Or is my reasoning wrong here?

PEGs are greedy by design, which is a consequence of the fact that PEGS do = not backtrack, which goes back to the underlying assumptive rule of PEGs th= at there is one (and only one!) =E2=80=98correct=E2=80=99 way to parse the = input. Allowing backtracking destroys that assumption and by doing so would= complicate the system to a level that would make it comparable to PCRE (wi= th all the associated penalties on processing speed and a much greater code= base).

has some good=C2=A0= lines:
e.g. this one (from the lpeg site) find the pattern anywhe= re in the line:

function anywhere (p)
=C2= =A0 return lpeg.P { p + 1 * lpeg.V(1) }
end
print (lpeg.match (anywhe= re ("dog"), target))

--
luigi
--0000000000003b856605b6f9b210-- --===============0577500067460226195== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX18KSWYgeW91ciBxdWVzdGlvbiBpcyBvZiBpbnRlcmVz dCB0byBvdGhlcnMgYXMgd2VsbCwgcGxlYXNlIGFkZCBhbiBlbnRyeSB0byB0aGUgV2lraSEKCm1h aWxsaXN0IDogbnRnLWNvbnRleHRAbnRnLm5sIC8gaHR0cDovL3d3dy5udGcubmwvbWFpbG1hbi9s aXN0aW5mby9udGctY29udGV4dAp3ZWJwYWdlICA6IGh0dHA6Ly93d3cucHJhZ21hLWFkZS5ubCAv IGh0dHA6Ly9jb250ZXh0LmFhbmhldC5uZXQKYXJjaGl2ZSAgOiBodHRwczovL2JpdGJ1Y2tldC5v cmcvcGhnL2NvbnRleHQtbWlycm9yL2NvbW1pdHMvCndpa2kgICAgIDogaHR0cDovL2NvbnRleHRn YXJkZW4ubmV0Cl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fCg== --===============0577500067460226195==--