caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Martin Jambon <martin.jambon@ens-lyon.org>
To: Jake Donham <jake.donham@skydeck.com>
Cc: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] ocamllex regexp problem
Date: Wed, 19 Mar 2008 16:21:07 +0100 (CET)	[thread overview]
Message-ID: <Pine.LNX.4.64.0803191620540.9504@martin.ec.wink.com> (raw)
In-Reply-To: <c7e4e9f0803181903p49ab284h8ad74e8021be2ccd@mail.gmail.com>

On Tue, 18 Mar 2008, Jake Donham wrote:

> Hi list,
>
> I am trying to parse an RSS feed using OCaml-RSS, which uses XML-Light,
> which however does not support CDATA blocks. So I added support in the
> ocamllex-based lexer as follows:
>
>  let ends_sq = [^']']* ']'
>  let ends_sq_sq = ends_sq ([^']'] ends_sq)* ']'+
>  let ends_sq_sq_ang = ends_sq_sq ([^'>'] ends_sq_sq)* '>'
>
> or expanded:
>
>  let ends_sq_sq_ang = (([^']']*']') ([^']'] ([^']']*']'))* ']'+) ([^'>']
> (([^']']*']') ([^']'] ([^']']*']'))* ']'+))* '>'
>
>  rule token = parse
>  [...]
>          | "<![CDATA[" (ends_sq_sq_ang as data)
>  [...]
>
> Here ends_sq_sq_ang is supposed to match strings ending in ]]> which may
> contain ] and >. If I give it an input like "foo]]]>bar]]>" (note the extra
> square bracket after foo), ocamllex matches the whole input instead of just
> "foo]]]>" as I would expect. But Micmatch, when given the same regexp, does
> the right thing. (The ']'+ bits are supposed to handle the "]]]>" case.)
>
> I have probably done something stupid and am embarrassing myself by
> advertising it to the list, but I did check it carefully. Any idea why this
> doesn't work? Thanks,

It's interesting. Note that both solutions are correct.
Using "shortest" instead of "parse" returns the shorter solution for this 
particular example. That may solve your problem.

In general, I find it hard to predict which solution should pop up earlier 
when some complex backtracking is involved, independently from any 
theoretical reasons.

My advice would be to use PCRE (from micmatch) for line-oriented parsing 
and take advantage of lazy quantifiers and assertions or ocamllex when 
end-of-lines are insignificant and things are nicely nested.
If it's not so simple, try to make several passes, possibly starting by 
discovering blocks based on indentation and then parse each block 
afterwards using another technique.

When in addition you have to extract the most out of your data even if 
some syntax errors are present, it gets hard. When you must tolerate these 
errors exactly in the same way as an existing dominant implementation 
(such as Mediawiki), it tends to become impossible.



Martin

--
http://wink.com/profile/mjambon
http://martin.jambon.free.fr


  parent reply	other threads:[~2008-03-19 15:21 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-03-19  2:03 Jake Donham
2008-03-19  9:00 ` [Caml-list] " Michael Wohlwend
2008-03-19 15:21 ` Martin Jambon [this message]
2008-03-19 16:39 ` Jake Donham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0803191620540.9504@martin.ec.wink.com \
    --to=martin.jambon@ens-lyon.org \
    --cc=caml-list@yquem.inria.fr \
    --cc=jake.donham@skydeck.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).