Re: Off-topic: Struggles with LPEG grammar

From: Taco Hoekwater <taco@elvenkind.com>
To: mailing list for ConTeXt users <ntg-context@ntg.nl>
Subject: Re: Off-topic: Struggles with LPEG grammar
Date: Mon, 21 Dec 2020 14:36:35 +0100	[thread overview]
Message-ID: <2FDD073C-DC12-4235-BBDC-0A5F4EE6B1F7@elvenkind.com> (raw)
In-Reply-To: <CALBOmsYx45L+SQLbFz3gW6MxMQa4mu-gJWbdL2iMXWbRkpA7mw@mail.gmail.com>

> On 21 Dec 2020, at 14:08, Mojca Miklavec <mojca.miklavec.lists@gmail.com> wrote:
> 
> Dear Taco,
> 
> On Mon, 21 Dec 2020 at 13:46, Taco Hoekwater wrote:
>>> On 21 Dec 2020, at 13:16, Mojca Miklavec wrote:
>>> 
>>> My only explanation would be that perhaps "^1" is so greedy that the
>>> rest of the pattern doesn't get found. But I don't want to believe
>>> that explanation.
>> 
>> Which (of course) means that that is exactly what happens ;)
>> 
>> The ones that match are
>> 
>> ababbb (a (ba+bb) b) => r4 r1(r3(r5 r4) r2(r5 r5)) r5
>> abbbab (a (bb+ba) b) => r4 r1(r2(r5 r5) r3(r5 r4)) r5
>> 
>> With the ^1, in the “bb” cases the first “b” eats all three “b”s:
>> 
>> ababbb fails the r5 at the end
>> 
>> abbbab fails the first r2 already (since the second r5 therein never happens)
> 
> Is this a deliberate choice, a limitation of the grammar
> expressiveness, some misuse on my side that could/should/needs to be
> implemented in a different way, or does it count as a "bug" on the
> lpeg side?
> 
> For example, I wouldn't expect a regexp "b+b" to fail on "bbb" just
> because "b+" would eat all three "b"s at once (the regexp "b+b" in
> fact finds "bbb", and I would expect a less-than-totally-greedy hit
> with lpeg as well). Or is my reasoning wrong here?

PEGs are greedy by design, which is a consequence of the fact that PEGS do not backtrack, which goes back to the underlying assumptive rule of PEGs that there is one (and only one!) ‘correct’ way to parse the input. Allowing backtracking destroys that assumption and by doing so would complicate the system to a level that would make it comparable to PCRE (with all the associated penalties on processing speed and a much greater codebase).

Another way of thinking of it (perhaps that helps): PEGs are _declarative_ on the input, whereas REs are _interpretive_.

Yet another way of thinking about it: PEGs rigorously define the (programming) language of the input.

It takes a bit of rewiring your brain to get comfortable with PEGs when you are used to REs, and unpredictable input is much harder to tackle in PEGs. OTOH, using PEGs are much better if there is an explicit grammar.

In your example, the fact that you even considered ^1 means that you were still thinking too much in terms of regular expressions, but I know it is very hard to learn how to do something that is _only a little_ different from something you know well already.

Best wishes,
Taco

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki     : http://contextgarden.net
___________________________________________________________________________________