From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by yquem.inria.fr (Postfix) with ESMTP id 6D2F5BBAF for ; Thu, 18 Sep 2008 20:27:32 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: As0AAFM60kjU436xlGdsb2JhbACBZJFHAQEBAQkDCgcRA6dJAYFk X-IronPort-AV: E=Sophos;i="4.32,422,1217800800"; d="scan'208";a="17467513" Received: from moutng.kundenserver.de ([212.227.126.177]) by mail1-smtp-roc.national.inria.fr with ESMTP; 18 Sep 2008 20:27:32 +0200 Received: from gate.lan.gerd-stolpmann.de (dslb-084-059-111-057.pools.arcor-ip.net [84.59.111.57]) by mrelayeu.kundenserver.de (node=mrelayeu3) with ESMTP (Nemesis) id 0MKxQS-1KgODu40ct-0000rD; Thu, 18 Sep 2008 20:27:27 +0200 Received: from [192.168.0.32] (fw.lan.gerd-stolpmann.de [192.168.1.1]) by gate.lan.gerd-stolpmann.de (Postfix) with ESMTP id 82E12C0AD; Thu, 18 Sep 2008 20:27:26 +0200 (CEST) Subject: Re: [Caml-list] XML library for validating MathML From: Gerd Stolpmann To: Dario Teixeira Cc: caml-list@yquem.inria.fr In-Reply-To: <359336.28901.qm@web54601.mail.re2.yahoo.com> References: <359336.28901.qm@web54601.mail.re2.yahoo.com> Content-Type: text/plain Date: Thu, 18 Sep 2008 20:28:03 +0200 Message-Id: <1221762483.17456.42.camel@flake.lan.gerd-stolpmann.de> Mime-Version: 1.0 X-Mailer: Evolution 2.12.1 Content-Transfer-Encoding: 7bit X-Provags-ID: V01U2FsdGVkX1+1C8DU37etsCEjCHqJfhKi52i+da8DlX0vvGm /NcDLR8CuG8FPTsbB6py+VamO6HNhhDmo5irKByfo4UIakWjo1 hSpEnpwN/ptRLLwoliiQnYWv1WZu02J X-Spam: no; 0.00; gerd:01 stolpmann:01 donnerstag:01 pxp:01 pxp:01 defaults:01 notation:01 hacked:01 parser:01 gerd:01 endline:01 node:01 endline:01 nodes:01 iter:01 Am Donnerstag, den 18.09.2008, 10:58 -0700 schrieb Dario Teixeira: > Hi, > > Well, as it turns out, building a basic "Hello World" in PXP is relatively > simple (I followed the manual which is very helpful in the beginning). > However, though the DTD validation works fine with the simple examples I tried, > it fails for a MathML document. Note that I am using the DTD as provided > by the W3C, available from here: http://www.w3.org/Math/DTD/mathml2.tgz > > When processing the MathML DTD, PXP outputs a few a warnings about entities > declared twice, about names reserved for future extensions, and quite a > lot of warnings about code points that cannot be represented. I can ignore > those for now. Code points: Note that PXP defaults to ISO-8859-1 as character set. Use it in UTF-8 mode to get rid of these warnings. > When it does fail, this is the error produced: > > In entity ent-isonum = PUBLIC "-//W3C//ENTITIES Numeric and Special Graphic for MathML 2.0//EN" "isonum.ent", at line 28, position 44: > Called from entity [dtd] = SYSTEM "mathml2.dtd", line 1969, position 0: > ERROR (Well-formedness constraint): The character '&' must be written as '&' > > > Looking at the "isonum.ent" file (packaged with the W3C zip), these are > the contents of line 28, where the error occurs: > > Well, the inner entities are again expanded when an entity is expanded. The correct way to define & is i.e. no second &. At _definition_ time this gives "&" (the first & is expanded), and at _use_ time you get finally &. With the wrong definition you get && at definition time, and this is simply an illegal character sequence. PXP defines by default & as "&#38;" which is just the same in decimal notation, and also recommended by the XML spec. That W3C docs are erroneous is nothing new, although it is a bit surprising that they cannot even stick to the basics of their own formalism. I suppose they used a hacked SGML parser for developing MathML, since SGML is more liberal about lexical details. Gerd > > > Though 0x26 is indeed the codepoint for the ampersand character, I don't > get why it appears twice. Is this a case of double escaping? Could this > be the reason PXP chokes? > > Any thoughts? > > Best regards, > Dario Teixeira > > P.S. This is the programme I used for testing. Its code is pretty much > lifted from the PXP manual: > > > open Pxp_document > open Pxp_yacc > > class warner = > object > method warn w = print_endline ("WARNING: " ^ w) > end > > let rec print_structure n = > let ntype = n#node_type > in match ntype with > | T_element name -> > print_endline ("Element of type " ^ name); > let children = n # sub_nodes > in List.iter print_structure children > | T_data -> > print_endline "Data" > | _ -> > assert false > > let () = > try > let config = {default_config with warner = new warner} in > let doc = parse_document_entity config (from_file "test.xml") default_spec > in print_structure (doc#root) > with > exc -> print_endline (Pxp_types.string_of_exn exc) > > > > > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de Phone: +49-6151-153855 Fax: +49-6151-997714 ------------------------------------------------------------