From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: weis Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id TAA05564 for caml-redistribution; Thu, 26 Aug 1999 19:17:04 +0200 (MET DST) Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id UAA10538 for ; Mon, 23 Aug 1999 20:01:31 +0200 (MET DST) Received: from beach.frankfurt.netsurf.de (beach.frankfurt.netsurf.de [194.64.181.2]) by nez-perce.inria.fr (8.8.7/8.8.7) with ESMTP id UAA17651 for ; Mon, 23 Aug 1999 20:01:30 +0200 (MET DST) Received: from schneemann.darmstadt.netsurf.de (board-39.darmstadt.netsurf.de [194.163.86.167]) by beach.frankfurt.netsurf.de (8.8.5/8.8.5) with ESMTP id UAA19080 for ; Mon, 23 Aug 1999 20:01:27 +0200 (MET DST) Received: from localhost (localhost [[UNIX: localhost]]) by schneemann.darmstadt.netsurf.de (8.8.8/8.8.8) id TAA19252 for caml-list@inria.fr; Mon, 23 Aug 1999 19:58:35 +0200 From: Gerd Stolpmann Reply-To: Gerd.Stolpmann@darmstadt.netsurf.de Organization: privat To: caml-list@inria.fr Subject: New software: Markup - validating XML parser for O'Caml Date: Mon, 23 Aug 1999 19:19:35 +0200 X-Mailer: KMail [version 1.0.17] Content-Type: text/plain MIME-Version: 1.0 Message-Id: <99082319583305.00491@schneemann> Content-Transfer-Encoding: 8bit Sender: weis Hi, I've written something new that might be interestant for you: A validating XML parser. Below you find the announcement. I had some trouble with O'Caml's lexer and parser generators, and I think this is a point of discussion. First, generation of lexers and parsers implies a dependency of the lexer on the parser, i.e. the parser is the logically preceding module. This does absolutely not match the requirements of XML which includes a macro language. If you reference a macro (XML calls them "entities"), the contents of the macro must be inserted into the token stream coming from the lexer. This means that the parser modifies the behaviour of the lexer, which means that the lexer is the logically preceding module. I've tried to avoid this dependency, and the minimal solution seems to be that at least the "token" type definition should be separately generated in such cases. Illustration: Current dependency (preceding modules first; "-->" means generation): parser.mly --> parser.mli, parser.ml define the "token" type, and the parser functions lexer.mll --> lexer.ml uses the "token" type (but not the parser functions) Suggested dependency: parser.mly --> parser_token.mli, parser_token.ml define only the "token" type lexer.mll --> lexer.ml uses the "token" type parser.mly --> parser.mli parser.ml define only the parser functions Here, the code in parser.ml can now call functions of the lexer in order to modify its behaviour. ocamlyacc should have two new options: -only-token-def, -only-parser-funs. I'm currently separating the two parts by some "sed" scripts, but this makes assuptions about the implementation of ocamlyacc. The second problem was that I wanted to make the whole parser polymorphic. As an XML parser should be customizable this is an obvious requirement. My XML parser uses side effects to store definitions; this seems not to be avoidable unless you accept a totally ineffective parser. As already presented on this mailing list (August, 13th), the problem can be solved if the module generated by ocamlyacc is turned into a class with a polymorphic parameter (the other way would be to use Obj.magic). I think this could be integrated into ocamlyacc, too. And now the ANNOUNCEMENT: Markup is an experimental validating XML parser for O'Caml. It is designed such that all of the requirements that the w3c demands of an XML parser can be fulfilled, but this is not yet the case. Beginning with the positive properties of this package, already the current (initial) release should be able to parse all valid XML documents and DTDs as long as they use only ISO-8859-1 characters. This has been tested with lots of test documents, including all of James Clark's positive test records. Once the document is parsed, it can be accessed using a class interface. This interface is still under development and subject to future changes. The interface allows arbitrary access including transformations. It has a so-called "extension", i.e. every element of the document has a main object and the extension. Although there is a default, the extension is thought as the changeable part of the element class, i.e. you can provide your own extension and add further properties to the elements. Note that the class interface does not comply to the DOM standard. I think that DOM is not applicable to O'Caml. Of course, my interface has a similar task. Another design feature is that you can configure the parser such that it uses a different class for every element type. As classes are not first-order objects in O'Caml, this is done by an exemplar/instance scheme. A hashtable stores for every element type the exemplar which is simply an object of the class to be used. If the parser reads an element, it looks up the exemplar by the type of the element, and forms a clone of it. -- This way, it is possible to group the processing code for the elements as classes which seems to be very natural. This distribution contains several examples: - validate: simply parses a document and prints all error messages - readme: Defines a DTD for simple "README"-like documents, and offers conversion to HTML and text files. - xmlforms: This is already a sophisticated application that uses XML as style sheet language and data storage format. It shows how a Tk user interface can be configured by an XML style, and how data records can be stored using XML. As usual, you can download Markup from: http://people.darmstadt.netsurf.de/Gerd.Stolpmann/ocaml/ Updates will be announced in the O'Caml link database, http://www.npc.de/ocaml/linkdb/ Gerd -- ---------------------------------------------------------------------------- Gerd Stolpmann Telefon: +49 6151 997705 (privat) Viktoriastr. 100 64293 Darmstadt EMail: Gerd.Stolpmann@darmstadt.netsurf.de (privat) Germany ----------------------------------------------------------------------------