From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by yquem.inria.fr (Postfix) with ESMTP id 4A5CFBBCA for ; Wed, 19 Mar 2008 16:21:54 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ag8BALTM4EdCbwQZhmdsb2JhbACRAwEBAQoDCQkWmQo X-IronPort-AV: E=Sophos;i="4.25,524,1199660400"; d="scan'208";a="8573310" Received: from out1.smtp.messagingengine.com ([66.111.4.25]) by mail2-smtp-roc.national.inria.fr with ESMTP; 19 Mar 2008 16:21:53 +0100 Received: from compute2.internal (compute2.internal [10.202.2.42]) by out1.messagingengine.com (Postfix) with ESMTP id 21804D89FE; Wed, 19 Mar 2008 11:21:52 -0400 (EDT) Received: from heartbeat2.messagingengine.com ([10.202.2.161]) by compute2.internal (MEProxy); Wed, 19 Mar 2008 11:21:52 -0400 X-Sasl-enc: zCEXGtL1adfmKCKACJrnD1ZlwmjTWwiu5cdvPT2Z1c1f 1205940111 Received: from [192.168.1.10] (ALyon-157-1-20-67.w81-251.abo.wanadoo.fr [81.251.179.67]) by mail.messagingengine.com (Postfix) with ESMTPSA id 527832DB57; Wed, 19 Mar 2008 11:21:51 -0400 (EDT) Date: Wed, 19 Mar 2008 16:21:07 +0100 (CET) From: Martin Jambon X-X-Sender: martin@martin.ec.wink.com To: Jake Donham Cc: caml-list@yquem.inria.fr Subject: Re: [Caml-list] ocamllex regexp problem In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Spam: no; 0.00; ens-lyon:01 ocamllex:01 regexp:01 lexer:01 foo:01 foo:01 ocamllex:01 regexp:01 backtracking:01 pcre:01 quantifiers:01 syntax:01 tolerate:01 token:01 assertions:01 On Tue, 18 Mar 2008, Jake Donham wrote: > Hi list, > > I am trying to parse an RSS feed using OCaml-RSS, which uses XML-Light, > which however does not support CDATA blocks. So I added support in the > ocamllex-based lexer as follows: > > let ends_sq = [^']']* ']' > let ends_sq_sq = ends_sq ([^']'] ends_sq)* ']'+ > let ends_sq_sq_ang = ends_sq_sq ([^'>'] ends_sq_sq)* '>' > > or expanded: > > let ends_sq_sq_ang = (([^']']*']') ([^']'] ([^']']*']'))* ']'+) ([^'>'] > (([^']']*']') ([^']'] ([^']']*']'))* ']'+))* '>' > > rule token = parse > [...] > | " [...] > > Here ends_sq_sq_ang is supposed to match strings ending in ]]> which may > contain ] and >. If I give it an input like "foo]]]>bar]]>" (note the extra > square bracket after foo), ocamllex matches the whole input instead of just > "foo]]]>" as I would expect. But Micmatch, when given the same regexp, does > the right thing. (The ']'+ bits are supposed to handle the "]]]>" case.) > > I have probably done something stupid and am embarrassing myself by > advertising it to the list, but I did check it carefully. Any idea why this > doesn't work? Thanks, It's interesting. Note that both solutions are correct. Using "shortest" instead of "parse" returns the shorter solution for this particular example. That may solve your problem. In general, I find it hard to predict which solution should pop up earlier when some complex backtracking is involved, independently from any theoretical reasons. My advice would be to use PCRE (from micmatch) for line-oriented parsing and take advantage of lazy quantifiers and assertions or ocamllex when end-of-lines are insignificant and things are nicely nested. If it's not so simple, try to make several passes, possibly starting by discovering blocks based on indentation and then parse each block afterwards using another technique. When in addition you have to extract the most out of your data even if some syntax errors are present, it gets hard. When you must tolerate these errors exactly in the same way as an existing dominant implementation (such as Mediawiki), it tends to become impossible. Martin -- http://wink.com/profile/mjambon http://martin.jambon.free.fr