caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: "Till Varoquaux" <till.varoquaux@gmail.com>
To: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] Fast XML parser
Date: Thu, 19 Jul 2007 11:02:35 +0200	[thread overview]
Message-ID: <9d3ec8300707190202t57a63aber38d86a5310cd0e9a@mail.gmail.com> (raw)
In-Reply-To: <E1IBPR6-0000rx-I4@kerneis.info>

OOps fogot to "reply to all". Here we go again:

On 7/19/07, Gabriel Kerneis <gabriel.kerneis@enst.fr> wrote:
>
> I certainly wouldn't recommend xml-light for *every* project where an
> XML parser is needed, but look at the OP's requirements :
> > > > I am interested in parsing Wiki markup language that has a few
> > > > tags, like <pre>...</pre>, <math>...,</math>.
> > > > These tags are sparse, meaning that the ratio of number of tags /
> > > > number of bytes is low.
> On such a simple case, xml-light (which is basically a simple ocamllex
> file + a few things to build the syntax tree) should perform quite
> well. I know it doesn't handle DTD, etc. but in *that* case, who cares ?
>
Xml-light would indeed provide a very simple parser and pretty good
speed. Whether to use it vs an event based parser is a matter of how
big these files really are (if they are not huge you shouldn't see a
real difference so you might as well keep it simple).

As for compliance, xml-light sort of does DTD. The issue is a lot more
subtle: it drops many features from the xml standard (including
encoding declaration) and thus will reject many valid xml documents.
This is, off course, not tolerable when you have to accepts documents
from sources other than your program... I wouldn't recommend xml-light
for any serious project reading xml files from the open. It can
however be great when you have control over the source generating your
documents (ie documents generated by xml-light itself).
> > Ultimately if you are parsing very simple files and are aiming for
> > pure speed you could write a simple lexer with ocamllex and use that
> > as base layer.
>
> That could be a solution, and (provided the licence you chose for your
> project is compatible) you could even use xml-light as an example to
> begin with (stripping things you don't need).

Indeed, and that should be real quick to do since the source code is
simple and easy to read. I should have mentioned it.

Cheers,
Til


  reply	other threads:[~2007-07-19  9:02 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-07-18 21:58 Luca de Alfaro
2007-07-18 22:11 ` [Caml-list] " Gabriel Kerneis
2007-07-18 22:48   ` Till Varoquaux
2007-07-19  6:24     ` Gabriel Kerneis
2007-07-19  9:02       ` Till Varoquaux [this message]
2007-07-19 11:38 ` Richard Jones
2007-07-20  7:01 ` Jon Harrop

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9d3ec8300707190202t57a63aber38d86a5310cd0e9a@mail.gmail.com \
    --to=till.varoquaux@gmail.com \
    --cc=caml-list@yquem.inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).