The lexer hack

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* The lexer hack
@ 2009-11-10 14:42 Dario Teixeira
  2009-11-10 17:35 ` [Caml-list] " David Allsopp
  2009-11-14 16:20 ` Micha
  0 siblings, 2 replies; 7+ messages in thread
From: Dario Teixeira @ 2009-11-10 14:42 UTC (permalink / raw)
  To: caml-list

Hi,

I'm creating a parser for a LaTeX-ish language that features verbatim blocks.
To handle them I want to switch lexers on-the-fly, depending on the parsing
context.  Therefore, I need the state from the (Menhir generated) parser
to influence the lexing process (I believe this is called the "lexer hack"
in compiler lore).

Presently I am doing this by placing a module between the lexer and the
parser, listening in on the flow of tokens, and using a crude state machine
to figure out the parsing context.  This solution is however error-prone
and a bit wasteful, since I'm reimplementing by hand stuff that should be
the sole competence of the parser generator.

Anyway, since I'm sure this problem pops up often, does someone have any
alternative suggestions?  I would preferably keep Menhir, but I'll switch
if some other generator offers a better approach(*).

Thanks + best regards,
Dario Teixeira

(*) I've looked into Dypgen, and its partial actions may offer a way out.
    Does someone have any experience with those and with real-world usage
    of Dypgen in general?  (In other words, is it stable enough for
    production use?)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [Caml-list] The lexer hack
  2009-11-10 14:42 The lexer hack Dario Teixeira
@ 2009-11-10 17:35 ` David Allsopp
  2009-11-10 20:02   ` Dario Teixeira
  2009-11-14 16:20 ` Micha
  1 sibling, 1 reply; 7+ messages in thread
From: David Allsopp @ 2009-11-10 17:35 UTC (permalink / raw)
  To: 'Dario Teixeira', caml-list

> I'm creating a parser for a LaTeX-ish language that features verbatim blocks.

Out of interest, how LaTeX-ish do you mean? I would hazard a guess that it's impossible to parse an unrestricted TeX file using an LR grammar (or at least no more clear than a hand-coded automaton) because you have to execute the macro expander in order to parse the file *completely* correctly. However, if you only mean LaTeX-ish in syntax (i.e. the files aren't actually TeX files) then you don't have to worry about TeX's elegant (by which I mean terrifying) \catcode mechanism and macro language!

David

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [Caml-list] The lexer hack
  2009-11-10 17:35 ` [Caml-list] " David Allsopp
@ 2009-11-10 20:02   ` Dario Teixeira
  0 siblings, 0 replies; 7+ messages in thread
From: Dario Teixeira @ 2009-11-10 20:02 UTC (permalink / raw)
  To: caml-list, David Allsopp

Hi,

> Out of interest, how LaTeX-ish do you mean? I would hazard
> a guess that it's impossible to parse an unrestricted TeX
> file using an LR grammar (or at least no more clear than a
> hand-coded automaton) because you have to execute the macro
> expander in order to parse the file *completely* correctly.
> However, if you only mean LaTeX-ish in syntax (i.e. the
> files aren't actually TeX files) then you don't have to
> worry about TeX's elegant (by which I mean terrifying)
> \catcode mechanism and macro language!

I developed the language's syntax in tandem with the parser/lexer
so I made sure it was LR-friendly and Ulex-friendly (the verbatim
environments are the only parsing-unfriendly features).  The language
looks and feels like LaTeX, but without the hairy stuff...

Incidentally, the dummy token/action trick seems to be working
fine with Menhir.  Since the parser will look ahead one token,
I just have a tokenizer sitting between the lexer and the parser,
and inserting a DUMMY token into the stream after any token that
precedes a dummy action:

inline:
  | (...)
  | BEGIN_VERBATIM enter_verb DUMMY RAW exit_verb END_VERBATIM {...}
  | (...)

enter_verb: /*empty*/ {Global.context := Global.Verbatim}
exit_verb: /*empty*/  {Global.context := Global.General}

It's not the prettiest thing in the world (and I suspect I might
still find some problem with it), but as far as lexer hacks go
it's not bad and a lot better than building a state machine.

Cheers,
Dario Teixeira

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] The lexer hack
  2009-11-10 14:42 The lexer hack Dario Teixeira
  2009-11-10 17:35 ` [Caml-list] " David Allsopp
@ 2009-11-14 16:20 ` Micha
  2009-11-14 18:08   ` Dario Teixeira
  2009-11-14 19:09   ` Goswin von Brederlow
  1 sibling, 2 replies; 7+ messages in thread
From: Micha @ 2009-11-14 16:20 UTC (permalink / raw)
  To: caml-list; +Cc: Dario Teixeira

On Tuesday, 10. November 2009 15:42:52 Dario Teixeira wrote:
> Hi,
>
> I'm creating a parser for a LaTeX-ish language that features verbatim
> blocks. To handle them I want to switch lexers on-the-fly, depending on the
> parsing context.  Therefore, I need the state from the (Menhir generated)
> parser to influence the lexing process (I believe this is called the "lexer
> hack" in compiler lore).

if the lexer cannot decide it on the tokens seen, a packrat parser (like 
Aurochs) may be a better choice, since in a PEG there is no seperate lexer, 
it's all one grammar, so you don't have this problem.

 Michael


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] The lexer hack
  2009-11-14 16:20 ` Micha
@ 2009-11-14 18:08   ` Dario Teixeira
  2009-11-14 19:09   ` Goswin von Brederlow
  1 sibling, 0 replies; 7+ messages in thread
From: Dario Teixeira @ 2009-11-14 18:08 UTC (permalink / raw)
  To: caml-list, Micha

Hi,

> if the lexer cannot decide it on the tokens seen, a packrat
> parser (like Aurochs) may be a better choice, since in a PEG
> there is no seperate lexer, it's all one grammar, so you don't
> have this problem.

But does Aurochs also handle UTF8 streams?

In the meantime I've implemented the parser using Ulex/Menhir
with the "dummy action" trick I mentioned before.  It allowed
me to simplify the tokenizer tremendously, though it's still
present:

https://forge.ocamlcore.org/plugins/scmsvn/viewcvs.php/trunk/lambdoc/src/lib/lambdoc_read_lambtex/?root=lambdoc

Cheers,
Dario Teixeira






^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] The lexer hack
  2009-11-14 16:20 ` Micha
  2009-11-14 18:08   ` Dario Teixeira
@ 2009-11-14 19:09   ` Goswin von Brederlow
  1 sibling, 0 replies; 7+ messages in thread
From: Goswin von Brederlow @ 2009-11-14 19:09 UTC (permalink / raw)
  To: Micha; +Cc: caml-list

Micha <micha-1@fantasymail.de> writes:

> On Tuesday, 10. November 2009 15:42:52 Dario Teixeira wrote:
>> Hi,
>>
>> I'm creating a parser for a LaTeX-ish language that features verbatim
>> blocks. To handle them I want to switch lexers on-the-fly, depending on the
>> parsing context.  Therefore, I need the state from the (Menhir generated)
>> parser to influence the lexing process (I believe this is called the "lexer
>> hack" in compiler lore).
>
> if the lexer cannot decide it on the tokens seen, a packrat parser (like 
> Aurochs) may be a better choice, since in a PEG there is no seperate lexer, 
> it's all one grammar, so you don't have this problem.
>
>  Michael

There usualy must be something present to support '(* <vebatim
text> *)' and '"<verbatim text>"', i.e. comments and strings.
Find out what is recommended for those and adapt it.

MfG
        Goswin


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: The lexer hack
@ 2009-11-10 15:26 Jeff Shaw
  0 siblings, 0 replies; 7+ messages in thread
From: Jeff Shaw @ 2009-11-10 15:26 UTC (permalink / raw)
  To: caml-list

Dario,
You could write your lexers in Menhir and make them part of your 
grammar. I know this isn't a terribly easy solution but it would be 
elegant IMO.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-11-14 19:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-10 14:42 The lexer hack Dario Teixeira
2009-11-10 17:35 ` [Caml-list] " David Allsopp
2009-11-10 20:02   ` Dario Teixeira
2009-11-14 16:20 ` Micha
2009-11-14 18:08   ` Dario Teixeira
2009-11-14 19:09   ` Goswin von Brederlow
2009-11-10 15:26 Jeff Shaw

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).