caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* ocamllex regexp problem
@ 2008-03-19  2:03 Jake Donham
  2008-03-19  9:00 ` [Caml-list] " Michael Wohlwend
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Jake Donham @ 2008-03-19  2:03 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 1097 bytes --]

Hi list,

I am trying to parse an RSS feed using OCaml-RSS, which uses XML-Light,
which however does not support CDATA blocks. So I added support in the
ocamllex-based lexer as follows:

  let ends_sq = [^']']* ']'
  let ends_sq_sq = ends_sq ([^']'] ends_sq)* ']'+
  let ends_sq_sq_ang = ends_sq_sq ([^'>'] ends_sq_sq)* '>'

or expanded:

  let ends_sq_sq_ang = (([^']']*']') ([^']'] ([^']']*']'))* ']'+) ([^'>']
(([^']']*']') ([^']'] ([^']']*']'))* ']'+))* '>'

  rule token = parse
  [...]
          | "<![CDATA[" (ends_sq_sq_ang as data)
  [...]

Here ends_sq_sq_ang is supposed to match strings ending in ]]> which may
contain ] and >. If I give it an input like "foo]]]>bar]]>" (note the extra
square bracket after foo), ocamllex matches the whole input instead of just
"foo]]]>" as I would expect. But Micmatch, when given the same regexp, does
the right thing. (The ']'+ bits are supposed to handle the "]]]>" case.)

I have probably done something stupid and am embarrassing myself by
advertising it to the list, but I did check it carefully. Any idea why this
doesn't work? Thanks,

Jake

[-- Attachment #2: Type: text/html, Size: 1486 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] ocamllex regexp problem
  2008-03-19  2:03 ocamllex regexp problem Jake Donham
@ 2008-03-19  9:00 ` Michael Wohlwend
  2008-03-19 15:21 ` Martin Jambon
  2008-03-19 16:39 ` Jake Donham
  2 siblings, 0 replies; 4+ messages in thread
From: Michael Wohlwend @ 2008-03-19  9:00 UTC (permalink / raw)
  To: caml-list

Am Mittwoch, 19. März 2008 03:03:25 schrieb Jake Donham:
> Hi list,
>   rule token = parse
>   [...]


I think the longest match rule eats the whole input.
maybe you want:

   rule token = parse
   [...]

           | "<![CDATA["  { cdata lexbuf }

   [...]

  and cdata = shortest
   | (_* as d)  "]]>" {  Printf.printf "found data:'%s'\n" d; }



 Michael


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] ocamllex regexp problem
  2008-03-19  2:03 ocamllex regexp problem Jake Donham
  2008-03-19  9:00 ` [Caml-list] " Michael Wohlwend
@ 2008-03-19 15:21 ` Martin Jambon
  2008-03-19 16:39 ` Jake Donham
  2 siblings, 0 replies; 4+ messages in thread
From: Martin Jambon @ 2008-03-19 15:21 UTC (permalink / raw)
  To: Jake Donham; +Cc: caml-list

On Tue, 18 Mar 2008, Jake Donham wrote:

> Hi list,
>
> I am trying to parse an RSS feed using OCaml-RSS, which uses XML-Light,
> which however does not support CDATA blocks. So I added support in the
> ocamllex-based lexer as follows:
>
>  let ends_sq = [^']']* ']'
>  let ends_sq_sq = ends_sq ([^']'] ends_sq)* ']'+
>  let ends_sq_sq_ang = ends_sq_sq ([^'>'] ends_sq_sq)* '>'
>
> or expanded:
>
>  let ends_sq_sq_ang = (([^']']*']') ([^']'] ([^']']*']'))* ']'+) ([^'>']
> (([^']']*']') ([^']'] ([^']']*']'))* ']'+))* '>'
>
>  rule token = parse
>  [...]
>          | "<![CDATA[" (ends_sq_sq_ang as data)
>  [...]
>
> Here ends_sq_sq_ang is supposed to match strings ending in ]]> which may
> contain ] and >. If I give it an input like "foo]]]>bar]]>" (note the extra
> square bracket after foo), ocamllex matches the whole input instead of just
> "foo]]]>" as I would expect. But Micmatch, when given the same regexp, does
> the right thing. (The ']'+ bits are supposed to handle the "]]]>" case.)
>
> I have probably done something stupid and am embarrassing myself by
> advertising it to the list, but I did check it carefully. Any idea why this
> doesn't work? Thanks,

It's interesting. Note that both solutions are correct.
Using "shortest" instead of "parse" returns the shorter solution for this 
particular example. That may solve your problem.

In general, I find it hard to predict which solution should pop up earlier 
when some complex backtracking is involved, independently from any 
theoretical reasons.

My advice would be to use PCRE (from micmatch) for line-oriented parsing 
and take advantage of lazy quantifiers and assertions or ocamllex when 
end-of-lines are insignificant and things are nicely nested.
If it's not so simple, try to make several passes, possibly starting by 
discovering blocks based on indentation and then parse each block 
afterwards using another technique.

When in addition you have to extract the most out of your data even if 
some syntax errors are present, it gets hard. When you must tolerate these 
errors exactly in the same way as an existing dominant implementation 
(such as Mediawiki), it tends to become impossible.



Martin

--
http://wink.com/profile/mjambon
http://martin.jambon.free.fr


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: ocamllex regexp problem
  2008-03-19  2:03 ocamllex regexp problem Jake Donham
  2008-03-19  9:00 ` [Caml-list] " Michael Wohlwend
  2008-03-19 15:21 ` Martin Jambon
@ 2008-03-19 16:39 ` Jake Donham
  2 siblings, 0 replies; 4+ messages in thread
From: Jake Donham @ 2008-03-19 16:39 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 577 bytes --]

On Tue, Mar 18, 2008 at 7:03 PM, Jake Donham <jake.donham@skydeck.com>
wrote:

>   let ends_sq = [^']']* ']'
>   let ends_sq_sq = ends_sq ([^']'] ends_sq)* ']'+
>   let ends_sq_sq_ang = ends_sq_sq ([^'>'] ends_sq_sq)* '>'


My colleague Haoyang Wang points out that my regexp, when viewed
nondeterministically, matches "foo]]]>bar]]>", since ']'+ may match only
"]]", then [^'>'] matches the third "]". Changing it to [^'>'']'] repairs
it. So I guess the answer is that Micmatch on PCRE treats the regexp as
greedy, while ocamllex does not.

Thanks to those who replied,

Jake

[-- Attachment #2: Type: text/html, Size: 981 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-03-19 16:39 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-19  2:03 ocamllex regexp problem Jake Donham
2008-03-19  9:00 ` [Caml-list] " Michael Wohlwend
2008-03-19 15:21 ` Martin Jambon
2008-03-19 16:39 ` Jake Donham

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).