From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=AWL,HTML_MESSAGE autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id 7D89ABBCA for ; Wed, 19 Mar 2008 03:03:42 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ArgBADMR4EfRVaKxmGdsb2JhbACCPDaODwEBAQEBBgQECQoYkVqGOQ X-IronPort-AV: E=Sophos;i="4.25,521,1199660400"; d="scan'208";a="23932350" Received: from el-out-1112.google.com ([209.85.162.177]) by mail4-smtp-sop.national.inria.fr with ESMTP; 19 Mar 2008 03:03:41 +0100 Received: by el-out-1112.google.com with SMTP id r23so200140elf.20 for ; Tue, 18 Mar 2008 19:03:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:mime-version:content-type:x-google-sender-auth; bh=cFfR4wXJAqRE4hpCX9/LnknM4d3KSNai3YfW82j6pUE=; b=IIWp845f3lJMAbvZwVQdzaOcKRkoWTYAkIhR3RPU/PPDcVLtflfqtfoUlAymMUDvwaQOPetqTjFPhqfx3KyOtO15D6Jc8D79mZwA/dQSTVmM1tCuQ9d0dkLEcCwwtAVrIV2ehwpDZI3VNXSrMBeiPxjcBQ2n2RAnY4ry1XUD9m0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=message-id:date:from:sender:to:subject:mime-version:content-type:x-google-sender-auth; b=IQyx+nUHfTl7IyGB29rLGBzHKDb9O9Vk0LJ7g1R/82ZH1+JcLkp1KkNeTP6Sv5Zw4ii2/C745UoezooV5ArJ+ZZJ57PNWnhu8h14/4s4/P5IGNvq6yZUC/Wbep8No1tguNcwAWQxNOICTBfmXLKeuTHxjY2p4L0EPhCokMfCgzo= Received: by 10.65.96.6 with SMTP id y6mr5259580qbl.5.1205892205745; Tue, 18 Mar 2008 19:03:25 -0700 (PDT) Received: by 10.70.32.5 with HTTP; Tue, 18 Mar 2008 19:03:25 -0700 (PDT) Message-ID: Date: Tue, 18 Mar 2008 19:03:25 -0700 From: "Jake Donham" Sender: jake.donham@gmail.com To: caml-list@yquem.inria.fr Subject: ocamllex regexp problem MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_15583_5066940.1205892205147" X-Google-Sender-Auth: 9047a1d57477e244 X-Spam: no; 0.00; ocamllex:01 regexp:01 lexer:01 foo:01 foo:01 ocamllex:01 regexp:01 lexer:01 token:01 token:01 strings:01 strings:01 data:02 data:02 match:02 ------=_Part_15583_5066940.1205892205147 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi list, I am trying to parse an RSS feed using OCaml-RSS, which uses XML-Light, which however does not support CDATA blocks. So I added support in the ocamllex-based lexer as follows: let ends_sq = [^']']* ']' let ends_sq_sq = ends_sq ([^']'] ends_sq)* ']'+ let ends_sq_sq_ang = ends_sq_sq ([^'>'] ends_sq_sq)* '>' or expanded: let ends_sq_sq_ang = (([^']']*']') ([^']'] ([^']']*']'))* ']'+) ([^'>'] (([^']']*']') ([^']'] ([^']']*']'))* ']'+))* '>' rule token = parse [...] | " which may contain ] and >. If I give it an input like "foo]]]>bar]]>" (note the extra square bracket after foo), ocamllex matches the whole input instead of just "foo]]]>" as I would expect. But Micmatch, when given the same regexp, does the right thing. (The ']'+ bits are supposed to handle the "]]]>" case.) I have probably done something stupid and am embarrassing myself by advertising it to the list, but I did check it carefully. Any idea why this doesn't work? Thanks, Jake ------=_Part_15583_5066940.1205892205147 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi list,

I am trying to parse an RSS feed using OCaml-RSS, which uses XML-Light, which however does not support CDATA blocks. So I added support in the ocamllex-based lexer as follows:

  let ends_sq = [^']']* ']'
  let ends_sq_sq = ends_sq ([^']'] ends_sq)* ']'+
  let ends_sq_sq_ang = ends_sq_sq ([^'>'] ends_sq_sq)* '>'

or expanded:

  let ends_sq_sq_ang = (([^']']*']') ([^']'] ([^']']*']'))* ']'+) ([^'>'] (([^']']*']') ([^']'] ([^']']*']'))* ']'+))* '>'

  rule token = parse
  [...]
          | "<![CDATA[" (ends_sq_sq_ang as data)
  [...]

Here ends_sq_sq_ang is supposed to match strings ending in ]]> which may contain ] and >. If I give it an input like "foo]]]>bar]]>" (note the extra square bracket after foo), ocamllex matches the whole input instead of just "foo]]]>" as I would expect. But Micmatch, when given the same regexp, does the right thing. (The ']'+ bits are supposed to handle the "]]]>" case.)

I have probably done something stupid and am embarrassing myself by advertising it to the list, but I did check it carefully. Any idea why this doesn't work? Thanks,

Jake

------=_Part_15583_5066940.1205892205147--