[Caml-list] Menhir grammar with sequences delimited by same token

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* [Caml-list] Menhir grammar with sequences delimited by same token
@ 2016-05-08  9:33 Dario Teixeira
  2016-05-08 10:27 ` Jacques-Henri Jourdan
  2016-05-08 13:35 ` Allan Wegan
  0 siblings, 2 replies; 8+ messages in thread
From: Dario Teixeira @ 2016-05-08  9:33 UTC (permalink / raw)
  To: caml-list

Hi,

(Sending this to Caml-list because Menhir-list is currently down.)

I've come across an interesting parsing problem, one for which I
wonder if there is a succinct solution in Menhir.  Suppose I want
to parse a markup which uses the same token for delimiting *both*
the beginning and the termination of a bold sequence (and likewise
for an emph sequence).  Basically this:

   inline:
     | TEXT               {Ast.Text $1}
     | BOLD inline* BOLD  {Ast.Bold $2}
     | EMPH inline* EMPH  {Ast.Emph $2}

Which of course has a shift/reduce conflict: if the token stream is
[BOLD; TEXT; BOLD; ...], what should the parser do upon encountering
the second BOLD -- start a new nesting level, or close the current
one?  I can force the latter behaviour by rearranging the grammar
so that an inline sequence within BOLDs cannot contain BOLD itself,
and likewise for EMPH:

   inline:
     | TEXT                        {Ast.Text $1}
     | BOLD inline_sans_bold* BOLD {Ast.Bold $2}
     | EMPH inline_sans_emph* EMPH {Ast.Emph $2}

   inline_sans_bold:
     | TEXT                        {Ast.Text $1}
     | EMPH inline_sans_emph* EMPH {Ast.Emph $2}

   inline_sans_emph:
     | TEXT                        {Ast.Text $1}
     | BOLD inline_sans_bold* BOLD {Ast.Bold $2}

For this simple example this approach is feasible, but blows up
into silliness for a real-world case where besides BOLD and EMPH I
have many other similar tokens.  Does Menhir offer a more succinct
solution to this problem?  (I reckon using the priority mechanism
somehow, but exactly how eludes me.)

Thanks in advance for your time!
Best regards,
Dario Teixeira

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Menhir grammar with sequences delimited by same token
  2016-05-08  9:33 [Caml-list] Menhir grammar with sequences delimited by same token Dario Teixeira
@ 2016-05-08 10:27 ` Jacques-Henri Jourdan
  2016-05-08 11:57   ` Sébastien Hinderer
  2016-05-08 13:43   ` Dario Teixeira
  2016-05-08 13:35 ` Allan Wegan
  1 sibling, 2 replies; 8+ messages in thread
From: Jacques-Henri Jourdan @ 2016-05-08 10:27 UTC (permalink / raw)
  To: caml-list


[-- Attachment #1.1: Type: text/plain, Size: 2365 bytes --]

Hi,


The priority mechanism won't help, because the LR1 automaton does not
contain the necessary information for deciding whether to shift or reduce.

The solution might be to use parameterized non-terminals, but all the
non-exponential solutions I can think of do not pass the termination
checker for parameterized non-terminals.

I have, however, another proposition : if you allow markup areas not be
well nested, then you can simply have an environment recording, for each
style, whether it is currently in use or not.


-- 

JH


Le 08/05/2016 à 11:33, Dario Teixeira a écrit :
> Hi,
>
> (Sending this to Caml-list because Menhir-list is currently down.)
>
> I've come across an interesting parsing problem, one for which I
> wonder if there is a succinct solution in Menhir.  Suppose I want
> to parse a markup which uses the same token for delimiting *both*
> the beginning and the termination of a bold sequence (and likewise
> for an emph sequence).  Basically this:
>
>   inline:
>     | TEXT               {Ast.Text $1}
>     | BOLD inline* BOLD  {Ast.Bold $2}
>     | EMPH inline* EMPH  {Ast.Emph $2}
>
>
> Which of course has a shift/reduce conflict: if the token stream is
> [BOLD; TEXT; BOLD; ...], what should the parser do upon encountering
> the second BOLD -- start a new nesting level, or close the current
> one?  I can force the latter behaviour by rearranging the grammar
> so that an inline sequence within BOLDs cannot contain BOLD itself,
> and likewise for EMPH:
>
>   inline:
>     | TEXT                        {Ast.Text $1}
>     | BOLD inline_sans_bold* BOLD {Ast.Bold $2}
>     | EMPH inline_sans_emph* EMPH {Ast.Emph $2}
>
>   inline_sans_bold:
>     | TEXT                        {Ast.Text $1}
>     | EMPH inline_sans_emph* EMPH {Ast.Emph $2}
>
>   inline_sans_emph:
>     | TEXT                        {Ast.Text $1}
>     | BOLD inline_sans_bold* BOLD {Ast.Bold $2}
>
>
> For this simple example this approach is feasible, but blows up
> into silliness for a real-world case where besides BOLD and EMPH I
> have many other similar tokens.  Does Menhir offer a more succinct
> solution to this problem?  (I reckon using the priority mechanism
> somehow, but exactly how eludes me.)
>
> Thanks in advance for your time!
> Best regards,
> Dario Teixeira
>
>



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Menhir grammar with sequences delimited by same token
  2016-05-08 10:27 ` Jacques-Henri Jourdan
@ 2016-05-08 11:57   ` Sébastien Hinderer
  2016-05-08 14:16     ` Dario Teixeira
  2016-05-08 13:43   ` Dario Teixeira
  1 sibling, 1 reply; 8+ messages in thread
From: Sébastien Hinderer @ 2016-05-08 11:57 UTC (permalink / raw)
  To: caml-list

Hi,

Jacques-Henri Jourdan (2016/05/08 12:27 +0200):
> Hi,
> 
> 
> The priority mechanism won't help, because the LR1 automaton does not
> contain the necessary information for deciding whether to shift or reduce.
> 
> The solution might be to use parameterized non-terminals, but all the
> non-exponential solutions I can think of do not pass the termination
> checker for parameterized non-terminals.
> 
> I have, however, another proposition : if you allow markup areas not be
> well nested, then you can simply have an environment recording, for each
> style, whether it is currently in use or not.

Jacques-Henri (or anybody else) please correct me if I am wrong.

Sometimes one way of dealing with such problems is to recognize (at the
syntactic stage) a
language that is a bit bigger than the desired one and then to check, at
the semantic level, that the real constraints are satisfied.

Sébastien.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Menhir grammar with sequences delimited by same token
  2016-05-08  9:33 [Caml-list] Menhir grammar with sequences delimited by same token Dario Teixeira
  2016-05-08 10:27 ` Jacques-Henri Jourdan
@ 2016-05-08 13:35 ` Allan Wegan
  2016-05-08 14:19   ` Dario Teixeira
  1 sibling, 1 reply; 8+ messages in thread
From: Allan Wegan @ 2016-05-08 13:35 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 797 bytes --]

> Which of course has a shift/reduce conflict: if the token stream is
> [BOLD; TEXT; BOLD; ...], what should the parser do upon encountering
> the second BOLD -- start a new nesting level, or close the current
> one?

I am only a beginner at formal language theory - so might have missed
important details.
But it seems to me that you may be trying to parse a context-sensitive
language with a parser for context-free languages. Changing the parser
may help.



-- 
Allan Wegan
<http://www.allanwegan.de/>
Jabber: allanwegan@ffnord.net
 OTR-Fingerprint: E4DCAA40 4859428E B3912896 F2498604 8CAA126F
Jabber: allanwegan@jabber.ccc.de
 OTR-Fingerprint: A1AAA1B9 C067F988 4A424D33 98343469 29164587
ICQ: 209459114
 OTR-Fingerprint: 71DE5B5E 67D6D758 A93BF1CE 7DA06625 205AC6EC


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Menhir grammar with sequences delimited by same token
  2016-05-08 10:27 ` Jacques-Henri Jourdan
  2016-05-08 11:57   ` Sébastien Hinderer
@ 2016-05-08 13:43   ` Dario Teixeira
  2016-05-08 21:29     ` Jacques-Henri Jourdan
  1 sibling, 1 reply; 8+ messages in thread
From: Dario Teixeira @ 2016-05-08 13:43 UTC (permalink / raw)
  To: Jacques-Henri Jourdan; +Cc: caml-list

Hi,

And thanks for the prompt reply!

> The priority mechanism won't help, because the LR1 automaton does not
> contain the necessary information for deciding whether to shift or 
> reduce.

Alas...

> The solution might be to use parameterized non-terminals, but all the
> non-exponential solutions I can think of do not pass the termination
> checker for parameterized non-terminals.

Hmm, I'll have to take a closer look at that feature of Menhir.

> I have, however, another proposition : if you allow markup areas not be
> well nested, then you can simply have an environment recording, for 
> each
> style, whether it is currently in use or not.

In the old version, the top-most parsing layer (generated via Menhir) 
would
only see tokens such as BEGIN_BOLD/END_BOLD.  There was an intermediate
layer between the lexer and the parser which had a simple state machine
that translated raw BOLD tokens into the BEGIN_BOLD/END_BOLD tokens.
I'm now trying to minimise the "magic" in the intermediate layer, which
is why I wondered if there was an elegant pure Menhir solution.

Best regards,
Dario Teixeira

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Menhir grammar with sequences delimited by same token
  2016-05-08 11:57   ` Sébastien Hinderer
@ 2016-05-08 14:16     ` Dario Teixeira
  0 siblings, 0 replies; 8+ messages in thread
From: Dario Teixeira @ 2016-05-08 14:16 UTC (permalink / raw)
  To: caml-list

Hi,

> Jacques-Henri (or anybody else) please correct me if I am wrong.
> 
> Sometimes one way of dealing with such problems is to recognize (at the
> syntactic stage) a
> language that is a bit bigger than the desired one and then to check, 
> at
> the semantic level, that the real constraints are satisfied.

You make a good point, but in this case the reverse actually applies.
Trees such as "[Bold [Bold Text]]" are valid according to the AST,
and are in fact possible with other markups.

Such trees may be nonsensical in practice, but are also harmless,
and the "AST sanitiser" (a component that checks for semantic
errors in the AST) does not presently complain about them.

The longer revised grammar (the one with inline_sans_bold et al)
accepts only a *subset* of the valid ASTs, because it excludes
such improbable (but valid!) derivations.  Therefore, this is
a case where the parser deliberately recognises a language that
is *smaller* than that accepted by the AST.

Best regards,
Dario Teixeira

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Menhir grammar with sequences delimited by same token
  2016-05-08 13:35 ` Allan Wegan
@ 2016-05-08 14:19   ` Dario Teixeira
  0 siblings, 0 replies; 8+ messages in thread
From: Dario Teixeira @ 2016-05-08 14:19 UTC (permalink / raw)
  To: Allan Wegan; +Cc: caml-list

Hi,

> I am only a beginner at formal language theory - so might have missed
> important details.
> But it seems to me that you may be trying to parse a context-sensitive
> language with a parser for context-free languages. Changing the parser
> may help.

Any Chomsky hierarchy experts in the audience?... ;-)

Best regards,
Dario Teixeira


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Menhir grammar with sequences delimited by same token
  2016-05-08 13:43   ` Dario Teixeira
@ 2016-05-08 21:29     ` Jacques-Henri Jourdan
  0 siblings, 0 replies; 8+ messages in thread
From: Jacques-Henri Jourdan @ 2016-05-08 21:29 UTC (permalink / raw)
  To: Dario Teixeira; +Cc: caml-list


[-- Attachment #1.1: Type: text/plain, Size: 1039 bytes --]


>> I have, however, another proposition : if you allow markup areas not be
>> well nested, then you can simply have an environment recording, for each
>> style, whether it is currently in use or not.
>
> In the old version, the top-most parsing layer (generated via Menhir)
> would
> only see tokens such as BEGIN_BOLD/END_BOLD.  There was an intermediate
> layer between the lexer and the parser which had a simple state machine
> that translated raw BOLD tokens into the BEGIN_BOLD/END_BOLD tokens.
> I'm now trying to minimise the "magic" in the intermediate layer, which
> is why I wondered if there was an elegant pure Menhir solution.
>

I see. Even if you could find a way to tell menhir what you want,
AFAICT, the size of the automaton will grow exponentially with the
number of kinds of tags, so this does not seems to be a good idea.

I think that you can either do it by hand just before parsing (as you
do) or just after parsing (as Sebastien proposed).

-- 
JH

> Best regards,
> Dario Teixeira



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-05-08 21:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-08  9:33 [Caml-list] Menhir grammar with sequences delimited by same token Dario Teixeira
2016-05-08 10:27 ` Jacques-Henri Jourdan
2016-05-08 11:57   ` Sébastien Hinderer
2016-05-08 14:16     ` Dario Teixeira
2016-05-08 13:43   ` Dario Teixeira
2016-05-08 21:29     ` Jacques-Henri Jourdan
2016-05-08 13:35 ` Allan Wegan
2016-05-08 14:19   ` Dario Teixeira

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).