From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by yquem.inria.fr (Postfix) with ESMTP id ECF2FBBAF for ; Thu, 18 Sep 2008 13:51:36 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AnIBAEbd0UjU436rmWdsb2JhbACBZJEpAQEBAQEIBQYJEQOmKgGBZg X-IronPort-AV: E=Sophos;i="4.32,421,1217800800"; d="scan'208";a="15090734" Received: from moutng.kundenserver.de ([212.227.126.171]) by mail2-smtp-roc.national.inria.fr with ESMTP; 18 Sep 2008 13:51:36 +0200 Received: from gate.lan.gerd-stolpmann.de (dslb-084-059-111-057.pools.arcor-ip.net [84.59.111.57]) by mrelayeu.kundenserver.de (node=mrelayeu3) with ESMTP (Nemesis) id 0MKxQS-1KgI2k1cUV-0000Ms; Thu, 18 Sep 2008 13:51:35 +0200 Received: from [192.168.0.32] (fw.lan.gerd-stolpmann.de [192.168.1.1]) by gate.lan.gerd-stolpmann.de (Postfix) with ESMTP id ED852C0AD; Thu, 18 Sep 2008 13:51:29 +0200 (CEST) Subject: Re: [Caml-list] XML library for validating MathML From: Gerd Stolpmann To: Till Varoquaux Cc: Vincent Hanquez , caml-list@yquem.inria.fr In-Reply-To: <9d3ec8300809180212r7e3dcdf3wd13c5cff69d5034b@mail.gmail.com> References: <103293.54569.qm@web54606.mail.re2.yahoo.com> <20080918083853.GA15219@snarc.org> <9d3ec8300809180212r7e3dcdf3wd13c5cff69d5034b@mail.gmail.com> Content-Type: text/plain Date: Thu, 18 Sep 2008 13:52:07 +0200 Message-Id: <1221738727.17456.27.camel@flake.lan.gerd-stolpmann.de> Mime-Version: 1.0 X-Mailer: Evolution 2.12.1 Content-Transfer-Encoding: 7bit X-Provags-ID: V01U2FsdGVkX1/OyRh8bcTI5+uVBzd/p72rMnXW8/pX73wrF8e X8tpYZJd8BL4ANIqPg5s8pVHXGbZdnkCQCJbqc6i2jgFgUIRhm ihoDNcA4VHklvtbnmm56GOxLnREYgtP X-Spam: no; 0.00; gerd:01 stolpmann:01 donnerstag:01 pxp:01 parsers:01 ocaml:01 pxp:01 gerd:01 parser:01 parser:01 parsers:01 subset:01 dtds:01 subset:01 beginner's:01 Am Donnerstag, den 18.09.2008, 10:12 +0100 schrieb Till Varoquaux: > PXP is tough to work with and feels a bit crazy but it is good with > standards (It can sort out any DTD's I have ever thrown at it). > xml-light is, well, very broken (it doesn't even support charcode > switching). There are several XML parsers in OCaml and I've had a > stint with a few of them; the only two I would consider using are > expat and Pxp with a marked preference for the later. PXP can be very > confusing and feels over engineered at times but it does the job. And > remember parsing XML is a hard job, much harder than we often give it > credit for.... > > Hats off to Gerd for providing us with a proper parser. Thanks. Initially, I thought XML is an easy format - because it looks easy. But the specs are really challenging - full of bad compromises, and I would expect that a widely adopted standard has to undergo some evaluation of its practicability before it is published. For instance, there are very strict rules where whitespace has to be in XML, and where it must not occur. E.g. is considered as illegal because of the missing space between the attributes. The whitespace rules make it practically impossible to use a yacc-generated parser (my first attempt was ocamlyacc-based, and it sort of worked after implementing lots of parsing tricks, but it was impossible to fix all errors, although the XML grammar is quite short after all). There are further complications in the XML standard, and after all, it is very difficult to implement it even on the most basic level. So there are many parsers now out there that do not do that, but rather implement a subset because this is easier and parsing is faster. There is much more to say about shortcomings in XML, or the XML standardization process. It is now an unnecessary complicated technology. I would advise everybody to use it only when there is no way around it, e.g. for exchange of structured data between organizations. I've got now a few hours of sponsorship for PXP. I'll try to improve the documentation, because there are some parts that need more explanation (where people feel it is over-engineered, but as Vincent pointed out, it's the standard that demands it). Gerd > > Till > > On Thu, Sep 18, 2008 at 9:38 AM, Vincent Hanquez wrote: > > On Wed, Sep 17, 2008 at 11:58:05AM -0700, Dario Teixeira wrote: > >> Given a string containing a mathematical expression in the MathML > >> markup, I need to verify that the expression is indeed valid MathML. > >> I am therefore looking for an XML library that can verify an expression > >> against a given DTD. > >> > >> Now, I have tried Xml-light, and the code I used is listed below. > >> Unfortunately, it fails when trying to parse MathML's DTD (it's the > >> standard DTD from the W3C). I have tried simpler DTDs, and it does work > >> with them; am I therefore correct in assuming that Xml-light can only > >> handle a particular version/subset of DTD features? > > > > I don't know about validation (i'll probably suggest looking at PXP tho), > > but xml-light is very bad for XML compliance. the library is (happily) parsing > > XML files that it shouldn't, which tell a lots concerning its validation > > abilities ... > > > > for example, the XML supported character range is not even checked: > > > > Xml 1.0 specification -- 2.2 Characters > > > > Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | > > [#xE000-#xFFFD] | [#x10000-#x10FFFF] > > > > others problems include (uncomplete list): > > - complete unicode un-awareness > > - funny & wrong entities handling > > > > -- > > Vincent > > > > _______________________________________________ > > Caml-list mailing list. Subscription management: > > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > > Archives: http://caml.inria.fr > > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > > Bug reports: http://caml.inria.fr/bin/caml-bugs > > > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de Phone: +49-6151-153855 Fax: +49-6151-997714 ------------------------------------------------------------