caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* XML library for validating MathML
@ 2008-09-17 18:58 Dario Teixeira
  2008-09-17 22:13 ` [Caml-list] " Richard Jones
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Dario Teixeira @ 2008-09-17 18:58 UTC (permalink / raw)
  To: caml-list

Hi,

Given a string containing a mathematical expression in the MathML
markup, I need to verify that the expression is indeed valid MathML.
I am therefore looking for an XML library that can verify an expression
against a given DTD.

Now, I have tried Xml-light, and the code I used is listed below.
Unfortunately, it fails when trying to parse MathML's DTD (it's the
standard DTD from the W3C).  I have tried simpler DTDs, and it does work
with them; am I therefore correct in assuming that Xml-light can only
handle a particular version/subset of DTD features?

let () =
        try
                let x = Xml.parse_file "file.xml" in
                let dtd = Dtd.parse_file "mathml.dtd" in
                let checked = Dtd.check dtd in
                let proven = Dtd.prove checked "math" x in
                print_endline (Xml.to_string proven)
        with
                Dtd.Parse_error exc -> print_endline (Dtd.parse_error exc)


There are of course other XML libraries for Ocaml and let's not forget Cduce.
Can someone recommend one solution that is guaranteed to work with the
MathML DTD?  Note that I don't need to do much with the XML tree; pretty
much all I need is a boilerplate function that returns a boolean on whether
a string is valid or not.

Thanks in advance for your input!
Best regards,
Dario Teixeira






^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] XML library for validating MathML
  2008-09-17 18:58 XML library for validating MathML Dario Teixeira
@ 2008-09-17 22:13 ` Richard Jones
  2008-09-18  2:58   ` Matt Gushee
  2008-09-18  8:38 ` Vincent Hanquez
  2008-09-18 14:26 ` Dario Teixeira
  2 siblings, 1 reply; 16+ messages in thread
From: Richard Jones @ 2008-09-17 22:13 UTC (permalink / raw)
  To: Dario Teixeira; +Cc: caml-list

On Wed, Sep 17, 2008 at 11:58:05AM -0700, Dario Teixeira wrote:
> There are of course other XML libraries for Ocaml and let's not forget Cduce.
> Can someone recommend one solution that is guaranteed to work with the
> MathML DTD?  Note that I don't need to do much with the XML tree; pretty
> much all I need is a boilerplate function that returns a boolean on whether
> a string is valid or not.

Well ... PXP is supposed to be able to validate.  No idea if it can
validate MathML though.

Rich.

-- 
Richard Jones
Red Hat


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] XML library for validating MathML
  2008-09-17 22:13 ` [Caml-list] " Richard Jones
@ 2008-09-18  2:58   ` Matt Gushee
  2008-09-18  8:06     ` Re : " Adrien
  0 siblings, 1 reply; 16+ messages in thread
From: Matt Gushee @ 2008-09-18  2:58 UTC (permalink / raw)
  To: caml-list

Oops, shoulda sent to the list first time. Sorry, Rich.

Richard Jones wrote:
> On Wed, Sep 17, 2008 at 11:58:05AM -0700, Dario Teixeira wrote:
>> There are of course other XML libraries for Ocaml and let's not forget Cduce.
>> Can someone recommend one solution that is guaranteed to work with the
>> MathML DTD?  Note that I don't need to do much with the XML tree; pretty
>> much all I need is a boilerplate function that returns a boolean on whether
>> a string is valid or not.
> 
> Well ... PXP is supposed to be able to validate.  No idea if it can
> validate MathML though.

I've never worked with MathML, but the XML 1.0 spec says that a 
validating XML parser *must* validate against any DTD. I think PXP is 
XML 1.0-compliant, but if you're not sure, the W3C has a compliance test 
suite that you could use to verify it.

-- 
Matt Gushee
: Bantam - lightweight file manager : matt.gushee.net/software/bantam/ :
: RASCL's A Simple Configuration Language :     matt.gushee.net/rascl/ :


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re : [Caml-list] XML library for validating MathML
  2008-09-18  2:58   ` Matt Gushee
@ 2008-09-18  8:06     ` Adrien
  0 siblings, 0 replies; 16+ messages in thread
From: Adrien @ 2008-09-18  8:06 UTC (permalink / raw)
  To: Matt Gushee; +Cc: caml-list

Hi,

I have encountered a problem with xml-light's dtd supportone year ago.
Basically if you have something name "ABCD" and something else named
"ABCDEF", it will fail. It could be your problem.
Unfortunately I never had time to fix it.

Well, no time to be more precise, I'm already late.


 ---

Adrien Nader

2008/9/18, Matt Gushee <matt@gushee.net>:
> Oops, shoulda sent to the list first time. Sorry, Rich.
>
> Richard Jones wrote:
>> On Wed, Sep 17, 2008 at 11:58:05AM -0700, Dario Teixeira wrote:
>>> There are of course other XML libraries for Ocaml and let's not forget
>>> Cduce.
>>> Can someone recommend one solution that is guaranteed to work with the
>>> MathML DTD?  Note that I don't need to do much with the XML tree; pretty
>>> much all I need is a boilerplate function that returns a boolean on
>>> whether
>>> a string is valid or not.
>>
>> Well ... PXP is supposed to be able to validate.  No idea if it can
>> validate MathML though.
>
> I've never worked with MathML, but the XML 1.0 spec says that a
> validating XML parser *must* validate against any DTD. I think PXP is
> XML 1.0-compliant, but if you're not sure, the W3C has a compliance test
> suite that you could use to verify it.
>
> --
> Matt Gushee
> : Bantam - lightweight file manager : matt.gushee.net/software/bantam/ :
> : RASCL's A Simple Configuration Language :     matt.gushee.net/rascl/ :
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] XML library for validating MathML
  2008-09-17 18:58 XML library for validating MathML Dario Teixeira
  2008-09-17 22:13 ` [Caml-list] " Richard Jones
@ 2008-09-18  8:38 ` Vincent Hanquez
  2008-09-18  9:12   ` Till Varoquaux
  2008-09-18 14:26 ` Dario Teixeira
  2 siblings, 1 reply; 16+ messages in thread
From: Vincent Hanquez @ 2008-09-18  8:38 UTC (permalink / raw)
  To: Dario Teixeira; +Cc: caml-list

On Wed, Sep 17, 2008 at 11:58:05AM -0700, Dario Teixeira wrote:
> Given a string containing a mathematical expression in the MathML
> markup, I need to verify that the expression is indeed valid MathML.
> I am therefore looking for an XML library that can verify an expression
> against a given DTD.
> 
> Now, I have tried Xml-light, and the code I used is listed below.
> Unfortunately, it fails when trying to parse MathML's DTD (it's the
> standard DTD from the W3C).  I have tried simpler DTDs, and it does work
> with them; am I therefore correct in assuming that Xml-light can only
> handle a particular version/subset of DTD features?

I don't know about validation (i'll probably suggest looking at PXP tho),
but xml-light is very bad for XML compliance. the library is (happily) parsing
XML files that it shouldn't, which tell a lots concerning its validation
abilities ...

for example, the XML supported character range is not even checked:

Xml 1.0 specification -- 2.2 Characters

Char	   ::=   	#x9 | #xA | #xD | [#x20-#xD7FF] |
		[#xE000-#xFFFD] | [#x10000-#x10FFFF]

others problems include (uncomplete list):
- complete unicode un-awareness
- funny & wrong entities handling

-- 
Vincent


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] XML library for validating MathML
  2008-09-18  8:38 ` Vincent Hanquez
@ 2008-09-18  9:12   ` Till Varoquaux
  2008-09-18  9:44     ` Vincent Hanquez
  2008-09-18 11:52     ` Gerd Stolpmann
  0 siblings, 2 replies; 16+ messages in thread
From: Till Varoquaux @ 2008-09-18  9:12 UTC (permalink / raw)
  To: Vincent Hanquez; +Cc: Dario Teixeira, caml-list

PXP is tough to work with and feels a bit crazy but it is good with
standards (It can sort out any DTD's I have ever thrown at it).
xml-light is, well, very broken (it doesn't even support charcode
switching). There are several XML parsers in OCaml and I've had a
stint with a few of them; the only two I would consider using are
expat and Pxp with a marked preference for the later. PXP can be very
confusing and feels over engineered at times but it does the job. And
remember parsing XML is a hard job, much harder than we often give it
credit for....

Hats off to Gerd for providing us with a proper parser.

Till

On Thu, Sep 18, 2008 at 9:38 AM, Vincent Hanquez <tab@snarc.org> wrote:
> On Wed, Sep 17, 2008 at 11:58:05AM -0700, Dario Teixeira wrote:
>> Given a string containing a mathematical expression in the MathML
>> markup, I need to verify that the expression is indeed valid MathML.
>> I am therefore looking for an XML library that can verify an expression
>> against a given DTD.
>>
>> Now, I have tried Xml-light, and the code I used is listed below.
>> Unfortunately, it fails when trying to parse MathML's DTD (it's the
>> standard DTD from the W3C).  I have tried simpler DTDs, and it does work
>> with them; am I therefore correct in assuming that Xml-light can only
>> handle a particular version/subset of DTD features?
>
> I don't know about validation (i'll probably suggest looking at PXP tho),
> but xml-light is very bad for XML compliance. the library is (happily) parsing
> XML files that it shouldn't, which tell a lots concerning its validation
> abilities ...
>
> for example, the XML supported character range is not even checked:
>
> Xml 1.0 specification -- 2.2 Characters
>
> Char       ::=          #x9 | #xA | #xD | [#x20-#xD7FF] |
>                [#xE000-#xFFFD] | [#x10000-#x10FFFF]
>
> others problems include (uncomplete list):
> - complete unicode un-awareness
> - funny & wrong entities handling
>
> --
> Vincent
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] XML library for validating MathML
  2008-09-18  9:12   ` Till Varoquaux
@ 2008-09-18  9:44     ` Vincent Hanquez
  2008-09-18 11:52     ` Gerd Stolpmann
  1 sibling, 0 replies; 16+ messages in thread
From: Vincent Hanquez @ 2008-09-18  9:44 UTC (permalink / raw)
  To: Till Varoquaux; +Cc: Dario Teixeira, caml-list

On Thu, Sep 18, 2008 at 10:12:26AM +0100, Till Varoquaux wrote:
> PXP is tough to work with and feels a bit crazy but it is good with
> standards (It can sort out any DTD's I have ever thrown at it).
> xml-light is, well, very broken (it doesn't even support charcode
> switching). There are several XML parsers in OCaml and I've had a
> stint with a few of them; the only two I would consider using are
> expat and Pxp with a marked preference for the later. PXP can be very
> confusing and feels over engineered at times but it does the job.

it's over engineered .. just like the XML spec :)

I don't do DTD validation, and i had great success with xmlm which is
_much_ better in term of XML compliance.

> And
> remember parsing XML is a hard job, much harder than we often give it
> credit for....

I certainly agree. it's hard, and also slow to parse.
i tends to prefer alternative formats nowadays.

-- 
Vincent


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] XML library for validating MathML
  2008-09-18  9:12   ` Till Varoquaux
  2008-09-18  9:44     ` Vincent Hanquez
@ 2008-09-18 11:52     ` Gerd Stolpmann
  2008-09-18 13:35       ` Markus Mottl
  2008-09-19 11:30       ` Matt Gushee
  1 sibling, 2 replies; 16+ messages in thread
From: Gerd Stolpmann @ 2008-09-18 11:52 UTC (permalink / raw)
  To: Till Varoquaux; +Cc: Vincent Hanquez, caml-list


Am Donnerstag, den 18.09.2008, 10:12 +0100 schrieb Till Varoquaux:
> PXP is tough to work with and feels a bit crazy but it is good with
> standards (It can sort out any DTD's I have ever thrown at it).
> xml-light is, well, very broken (it doesn't even support charcode
> switching). There are several XML parsers in OCaml and I've had a
> stint with a few of them; the only two I would consider using are
> expat and Pxp with a marked preference for the later. PXP can be very
> confusing and feels over engineered at times but it does the job. And
> remember parsing XML is a hard job, much harder than we often give it
> credit for....
> 
> Hats off to Gerd for providing us with a proper parser.

Thanks. Initially, I thought XML is an easy format - because it looks
easy. But the specs are really challenging - full of bad compromises,
and I would expect that a widely adopted standard has to undergo some
evaluation of its practicability before it is published. For instance,
there are very strict rules where whitespace has to be in XML, and where
it must not occur. E.g. <tag x="a"y="b"> is considered as illegal
because of the missing space between the attributes. The whitespace
rules make it practically impossible to use a yacc-generated parser (my
first attempt was ocamlyacc-based, and it sort of worked after
implementing lots of parsing tricks, but it was impossible to fix all
errors, although the XML grammar is quite short after all). There are
further complications in the XML standard, and after all, it is very
difficult to implement it even on the most basic level. So there are
many parsers now out there that do not do that, but rather implement a
subset because this is easier and parsing is faster.

There is much more to say about shortcomings in XML, or the XML
standardization process. It is now an unnecessary complicated
technology. I would advise everybody to use it only when there is no way
around it, e.g. for exchange of structured data between organizations.

I've got now a few hours of sponsorship for PXP. I'll try to improve the
documentation, because there are some parts that need more explanation
(where people feel it is over-engineered, but as Vincent pointed out,
it's the standard that demands it).

Gerd


> 
> Till
> 
> On Thu, Sep 18, 2008 at 9:38 AM, Vincent Hanquez <tab@snarc.org> wrote:
> > On Wed, Sep 17, 2008 at 11:58:05AM -0700, Dario Teixeira wrote:
> >> Given a string containing a mathematical expression in the MathML
> >> markup, I need to verify that the expression is indeed valid MathML.
> >> I am therefore looking for an XML library that can verify an expression
> >> against a given DTD.
> >>
> >> Now, I have tried Xml-light, and the code I used is listed below.
> >> Unfortunately, it fails when trying to parse MathML's DTD (it's the
> >> standard DTD from the W3C).  I have tried simpler DTDs, and it does work
> >> with them; am I therefore correct in assuming that Xml-light can only
> >> handle a particular version/subset of DTD features?
> >
> > I don't know about validation (i'll probably suggest looking at PXP tho),
> > but xml-light is very bad for XML compliance. the library is (happily) parsing
> > XML files that it shouldn't, which tell a lots concerning its validation
> > abilities ...
> >
> > for example, the XML supported character range is not even checked:
> >
> > Xml 1.0 specification -- 2.2 Characters
> >
> > Char       ::=          #x9 | #xA | #xD | [#x20-#xD7FF] |
> >                [#xE000-#xFFFD] | [#x10000-#x10FFFF]
> >
> > others problems include (uncomplete list):
> > - complete unicode un-awareness
> > - funny & wrong entities handling
> >
> > --
> > Vincent
> >
> > _______________________________________________
> > Caml-list mailing list. Subscription management:
> > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> > Archives: http://caml.inria.fr
> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> > Bug reports: http://caml.inria.fr/bin/caml-bugs
> >
> 
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> 
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] XML library for validating MathML
  2008-09-18 11:52     ` Gerd Stolpmann
@ 2008-09-18 13:35       ` Markus Mottl
  2008-09-19 11:30       ` Matt Gushee
  1 sibling, 0 replies; 16+ messages in thread
From: Markus Mottl @ 2008-09-18 13:35 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: Till Varoquaux, caml-list

On Thu, Sep 18, 2008 at 7:52 AM, Gerd Stolpmann <info@gerd-stolpmann.de> wrote:
> There is much more to say about shortcomings in XML, or the XML
> standardization process. It is now an unnecessary complicated
> technology.

I take it that you deliberately chose the adjective "unnecessary"
rather than the adverb "unnecessarily"... ;-)

Regards,
Markus

-- 
Markus Mottl http://www.ocaml.info markus.mottl@gmail.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] XML library for validating MathML
  2008-09-17 18:58 XML library for validating MathML Dario Teixeira
  2008-09-17 22:13 ` [Caml-list] " Richard Jones
  2008-09-18  8:38 ` Vincent Hanquez
@ 2008-09-18 14:26 ` Dario Teixeira
  2008-09-18 17:58   ` Dario Teixeira
  2 siblings, 1 reply; 16+ messages in thread
From: Dario Teixeira @ 2008-09-18 14:26 UTC (permalink / raw)
  To: caml-list

Hi,

First of all, thanks to all for your input -- it was much appreciated.

I second many people's opinion that XML -- though conceptually a good and
well-intentioned idea --- has devolved into a festering plague whose tentacles
have unfortunately spread all around the IT landscape.  I'm finding that the
lispy alternative offered by Sexplib is much saner for most serialisation
and data exchange purposes, and that if you need human-readability than a
format like JSON is much friendlier.

Unfortunately, I have to deal with MathML because I am building a web
application (with Ocsigen) that allows users to enter mathematical expressions
in either TeX or MathML format.  To prevent cross-site scripting attacks,
I need to make sure the MathML provided by the user is valid and does not
contain any "funny stuff".

As for Xml-light, thanks for confirming that the problem I was having is
most likely do bugs in the library itself.  I was hoping Xml-light would be
able to do the job because it's a library you can grok and start using right
away, whereas PXP will require a more significant investment of time.  Btw,
Gerd, the problem I have is fairly boilerplate:  "here's a DTD and here's a
string with an XML document -- is the document valid according to the DTD?",
perhaps you can add it as an example to the documentation?

Well, off to read the PXP manual!

Thanks for your time,
Dario Teixeira






^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] XML library for validating MathML
  2008-09-18 14:26 ` Dario Teixeira
@ 2008-09-18 17:58   ` Dario Teixeira
  2008-09-18 18:28     ` Gerd Stolpmann
  0 siblings, 1 reply; 16+ messages in thread
From: Dario Teixeira @ 2008-09-18 17:58 UTC (permalink / raw)
  To: caml-list

Hi,

Well, as it turns out, building a basic "Hello World" in PXP is relatively
simple (I followed the manual which is very helpful in the beginning).
However, though the DTD validation works fine with the simple examples I tried,
it fails for a MathML document.  Note that I am using the DTD as provided
by the W3C, available from here:  http://www.w3.org/Math/DTD/mathml2.tgz

When processing the MathML DTD, PXP outputs a few a warnings about entities
declared twice, about names reserved for future extensions, and quite a
lot of warnings about code points that cannot be represented.  I can ignore
those for now.

When it does fail, this is the error produced:

In entity ent-isonum = PUBLIC "-//W3C//ENTITIES Numeric and Special Graphic for MathML 2.0//EN" "isonum.ent", at line 28, position 44:
Called from entity [dtd] = SYSTEM "mathml2.dtd", line 1969, position 0:
ERROR (Well-formedness constraint): The character '&' must be written as '&amp;'


Looking at the "isonum.ent" file (packaged with the W3C zip), these are
the contents of line 28, where the error occurs:

<!ENTITY amp              "&#x26;&#x00026;" ><!--=ampersand -->


Though 0x26 is indeed the codepoint for the ampersand character, I don't
get why it appears twice.  Is this a case of double escaping?  Could this
be the reason PXP chokes?

Any thoughts?

Best regards,
Dario Teixeira

P.S.  This is the programme I used for testing.  Its code is pretty much
      lifted from the PXP manual:


open Pxp_document
open Pxp_yacc

class warner =
object
        method warn w = print_endline ("WARNING: " ^ w)
end

let rec print_structure n =
        let ntype = n#node_type
        in match ntype with
                | T_element name ->
                        print_endline ("Element of type " ^ name);
                        let children = n # sub_nodes
                        in List.iter print_structure children
                | T_data ->
                        print_endline "Data"
                | _ ->
                        assert false

let () =
        try
                let config = {default_config with warner = new warner} in
                let doc = parse_document_entity config (from_file "test.xml") default_spec
                in print_structure (doc#root)
        with
                exc -> print_endline (Pxp_types.string_of_exn exc)






^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] XML library for validating MathML
  2008-09-18 17:58   ` Dario Teixeira
@ 2008-09-18 18:28     ` Gerd Stolpmann
  2008-09-18 20:44       ` Dario Teixeira
  0 siblings, 1 reply; 16+ messages in thread
From: Gerd Stolpmann @ 2008-09-18 18:28 UTC (permalink / raw)
  To: Dario Teixeira; +Cc: caml-list


Am Donnerstag, den 18.09.2008, 10:58 -0700 schrieb Dario Teixeira:
> Hi,
> 
> Well, as it turns out, building a basic "Hello World" in PXP is relatively
> simple (I followed the manual which is very helpful in the beginning).
> However, though the DTD validation works fine with the simple examples I tried,
> it fails for a MathML document.  Note that I am using the DTD as provided
> by the W3C, available from here:  http://www.w3.org/Math/DTD/mathml2.tgz
> 
> When processing the MathML DTD, PXP outputs a few a warnings about entities
> declared twice, about names reserved for future extensions, and quite a
> lot of warnings about code points that cannot be represented.  I can ignore
> those for now.

Code points: Note that PXP defaults to ISO-8859-1 as character set. Use
it in UTF-8 mode to get rid of these warnings.

> When it does fail, this is the error produced:
> 
> In entity ent-isonum = PUBLIC "-//W3C//ENTITIES Numeric and Special Graphic for MathML 2.0//EN" "isonum.ent", at line 28, position 44:
> Called from entity [dtd] = SYSTEM "mathml2.dtd", line 1969, position 0:
> ERROR (Well-formedness constraint): The character '&' must be written as '&amp;'
> 
> 
> Looking at the "isonum.ent" file (packaged with the W3C zip), these are
> the contents of line 28, where the error occurs:
> 
> <!ENTITY amp              "&#x26;&#x00026;" ><!--=ampersand -->

Well, the inner entities are again expanded when an entity is expanded.
The correct way to define &amp; is

<!ENTITY amp "&#x26;#x26;">

i.e. no second &. At _definition_ time this gives "&#x26;" (the first
&#x26; is expanded), and at _use_ time you get finally &. With the wrong
definition you get && at definition time, and this is simply an illegal
character sequence.

PXP defines by default &amp; as "&#38;#38;" which is just the same in
decimal notation, and also recommended by the XML spec.

That W3C docs are erroneous is nothing new, although it is a bit
surprising that they cannot even stick to the basics of their own
formalism. I suppose they used a hacked SGML parser for developing
MathML, since SGML is more liberal about lexical details.

Gerd

> 
> 
> Though 0x26 is indeed the codepoint for the ampersand character, I don't
> get why it appears twice.  Is this a case of double escaping?  Could this
> be the reason PXP chokes?
> 
> Any thoughts?
> 
> Best regards,
> Dario Teixeira
> 
> P.S.  This is the programme I used for testing.  Its code is pretty much
>       lifted from the PXP manual:
> 
> 
> open Pxp_document
> open Pxp_yacc
> 
> class warner =
> object
>         method warn w = print_endline ("WARNING: " ^ w)
> end
> 
> let rec print_structure n =
>         let ntype = n#node_type
>         in match ntype with
>                 | T_element name ->
>                         print_endline ("Element of type " ^ name);
>                         let children = n # sub_nodes
>                         in List.iter print_structure children
>                 | T_data ->
>                         print_endline "Data"
>                 | _ ->
>                         assert false
> 
> let () =
>         try
>                 let config = {default_config with warner = new warner} in
>                 let doc = parse_document_entity config (from_file "test.xml") default_spec
>                 in print_structure (doc#root)
>         with
>                 exc -> print_endline (Pxp_types.string_of_exn exc)
> 
> 
> 
>       
> 
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> 
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] XML library for validating MathML
  2008-09-18 18:28     ` Gerd Stolpmann
@ 2008-09-18 20:44       ` Dario Teixeira
  2008-09-18 20:48         ` Gerd Stolpmann
  2008-09-19 13:23         ` Stefano Zacchiroli
  0 siblings, 2 replies; 16+ messages in thread
From: Dario Teixeira @ 2008-09-18 20:44 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: caml-list

Hi,

> Code points: Note that PXP defaults to ISO-8859-1 as
> character set. Use
> it in UTF-8 mode to get rid of these warnings.

Ah, thanks, I'll look into that.  By the way, is there an
API reference for PXP?  I noticed you are not using Ocamldoc,
and I'm guessing the PXP manual is built from some sort of
literate programming tool.  Unfortunately, the manual is
quite large, and it can be difficult to navigate.


> That W3C docs are erroneous is nothing new, although it is
> a bit surprising that they cannot even stick to the basics of
> their own formalism. I suppose they used a hacked SGML parser
> for developing MathML, since SGML is more liberal about
> lexical details.

I'm sure floggings will be administered.  Anyway, thanks for
making the diagnostic on the problem!  I am happy to report
that once the isonum.ent file is fixed, PXP is able to validate
MathML documents against the official DTD.

Fixing the isonum.ent was simply a matter of applying Gerd's
correction to two lines, 28 and 66.  Here is the diff:

28c28
< <!ENTITY amp              "&#x26;&#x00026;" ><!--=ampersand -->
---
> <!ENTITY amp              "&#x26;#x00026;" ><!--=ampersand -->
66c66
< <!ENTITY lt               "&#x26;&#x0003C;" ><!--=less-than sign R: -->
---
> <!ENTITY lt               "&#x26;#x0003C;" ><!--=less-than sign R: -->


Cheers,
Dario Teixeira






^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] XML library for validating MathML
  2008-09-18 20:44       ` Dario Teixeira
@ 2008-09-18 20:48         ` Gerd Stolpmann
  2008-09-19 13:23         ` Stefano Zacchiroli
  1 sibling, 0 replies; 16+ messages in thread
From: Gerd Stolpmann @ 2008-09-18 20:48 UTC (permalink / raw)
  To: Dario Teixeira; +Cc: caml-list


Am Donnerstag, den 18.09.2008, 13:44 -0700 schrieb Dario Teixeira:
> Hi,
> 
> > Code points: Note that PXP defaults to ISO-8859-1 as
> > character set. Use
> > it in UTF-8 mode to get rid of these warnings.
> 
> Ah, thanks, I'll look into that.  By the way, is there an
> API reference for PXP?  I noticed you are not using Ocamldoc,
> and I'm guessing the PXP manual is built from some sort of
> literate programming tool.  Unfortunately, the manual is
> quite large, and it can be difficult to navigate.

I know, it's from pre-ocamldoc times. I'll see what I can do.

Gerd

> > That W3C docs are erroneous is nothing new, although it is
> > a bit surprising that they cannot even stick to the basics of
> > their own formalism. I suppose they used a hacked SGML parser
> > for developing MathML, since SGML is more liberal about
> > lexical details.
> 
> I'm sure floggings will be administered.  Anyway, thanks for
> making the diagnostic on the problem!  I am happy to report
> that once the isonum.ent file is fixed, PXP is able to validate
> MathML documents against the official DTD.
> 
> Fixing the isonum.ent was simply a matter of applying Gerd's
> correction to two lines, 28 and 66.  Here is the diff:
> 
> 28c28
> < <!ENTITY amp              "&#x26;&#x00026;" ><!--=ampersand -->
> ---
> > <!ENTITY amp              "&#x26;#x00026;" ><!--=ampersand -->
> 66c66
> < <!ENTITY lt               "&#x26;&#x0003C;" ><!--=less-than sign R: -->
> ---
> > <!ENTITY lt               "&#x26;#x0003C;" ><!--=less-than sign R: -->
> 
> 
> Cheers,
> Dario Teixeira
> 
> 
> 
>       
> 
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] XML library for validating MathML
  2008-09-18 11:52     ` Gerd Stolpmann
  2008-09-18 13:35       ` Markus Mottl
@ 2008-09-19 11:30       ` Matt Gushee
  1 sibling, 0 replies; 16+ messages in thread
From: Matt Gushee @ 2008-09-19 11:30 UTC (permalink / raw)
  To: caml-list

If you'll forgive the slight off-topic-ness ... I spent a couple of 
years as an XML consultant. Got out of the business partly for the kinds 
of reasons Gerd cites. Anyway, just a couple of quick comments:

Gerd Stolpmann wrote:

> Thanks. Initially, I thought XML is an easy format - because it looks
> easy. But the specs are really challenging - full of bad compromises,
> and I would expect that a widely adopted standard has to undergo some
> evaluation of its practicability before it is published.

Well, yes, one would expect that. Two factors to consider are that:

  1) XML is descended from SGML--which was horribly difficult to
     implement (hey, it was invented at IBM and used by the US Dept of
     Defense, so what do you expect?). Compared to full SGML, XML is very
     simple.

  2) The process of developing W3C specs (technically they're not
     "standards," though I'm not sure that really matters to anyone)
     is vendor-driven, hence highly politicized. I mean, put Microsoft,
     Sun, IBM, etc. together in the same room--need I say more? Whereas
     the IETF groups that develop RFCs, for example, are more of a
     technical meritocracy, something like the Open Source software
     community. So it should be no surprise that RFCs tend to have
     fairly good technical foundations (HTTP, anyone?) while many of
     the W3C specs are just hideous.

  3) Regarding the white-space problem, I think the design may have been
     influenced (rightly or wrongly) by notions about white space being
     helpful for human readers ... somewhat like Python with its
     indentation-based code blocks.

> There is much more to say about shortcomings in XML, or the XML
> standardization process. It is now an unnecessary complicated
> technology. I would advise everybody to use it only when there is no way
> around it, e.g. for exchange of structured data between organizations.

And it's really not optimal for that, it's just a widely-adopted lowest 
common denominator. But bear in mind that XML was originally intended 
for structured *documents*; it got hijacked for other purposes amid all 
the e-commerce hype around the turn of the century. Some of the 
XML-based document formats (DocBook? SVG?) are, IMHO, not such bad 
compromises if you consider the known alternatives.

-- 
Matt Gushee
: Bantam - lightweight file manager : matt.gushee.net/software/bantam/ :
: RASCL's A Simple Configuration Language :     matt.gushee.net/rascl/ :


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] XML library for validating MathML
  2008-09-18 20:44       ` Dario Teixeira
  2008-09-18 20:48         ` Gerd Stolpmann
@ 2008-09-19 13:23         ` Stefano Zacchiroli
  1 sibling, 0 replies; 16+ messages in thread
From: Stefano Zacchiroli @ 2008-09-19 13:23 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 946 bytes --]

On Thu, Sep 18, 2008 at 01:44:27PM -0700, Dario Teixeira wrote:
> Ah, thanks, I'll look into that.  By the way, is there an
> API reference for PXP?  I noticed you are not using Ocamldoc,

For the Debian package, I've a hack to generate ocamldoc API reference
from PXP (and for other pre-ocamldoc projects by Gerd). Basically I'm
passing to ocamldoc the following pre-processing flags:

  -pp debian/expand_stars.sh

The corresponding shell script is attached.

Yes, it is hackish, but gives decent results. Sure they can be improved
by reviewing the comments and making them conform to ocamldoc
conventions, but is still something usable.

Cheers.

-- 
Stefano Zacchiroli -*- PhD in Computer Science \ PostDoc @ Univ. Paris 7
zack@{upsilon.cc,pps.jussieu.fr,debian.org} -<>- http://upsilon.cc/zack/
I'm still an SGML person,this newfangled /\ All one has to do is hit the
XML stuff is so ... simplistic  -- Manoj \/ right keys at the right time

[-- Attachment #2: expand_stars.sh --]
[-- Type: application/x-sh, Size: 82 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2008-09-19 13:24 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-17 18:58 XML library for validating MathML Dario Teixeira
2008-09-17 22:13 ` [Caml-list] " Richard Jones
2008-09-18  2:58   ` Matt Gushee
2008-09-18  8:06     ` Re : " Adrien
2008-09-18  8:38 ` Vincent Hanquez
2008-09-18  9:12   ` Till Varoquaux
2008-09-18  9:44     ` Vincent Hanquez
2008-09-18 11:52     ` Gerd Stolpmann
2008-09-18 13:35       ` Markus Mottl
2008-09-19 11:30       ` Matt Gushee
2008-09-18 14:26 ` Dario Teixeira
2008-09-18 17:58   ` Dario Teixeira
2008-09-18 18:28     ` Gerd Stolpmann
2008-09-18 20:44       ` Dario Teixeira
2008-09-18 20:48         ` Gerd Stolpmann
2008-09-19 13:23         ` Stefano Zacchiroli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).