caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Fast XML parser
@ 2007-07-18 21:58 Luca de Alfaro
  2007-07-18 22:11 ` [Caml-list] " Gabriel Kerneis
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Luca de Alfaro @ 2007-07-18 21:58 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 449 bytes --]

I am interested in parsing Wiki markup language that has a few tags, like
<pre>...</pre>, <math>...,</math>.
These tags are sparse, meaning that the ratio of number of tags / number of
bytes is low.
I would like, given a string (or a stream) with such tags, to parse it as
fast as possible.  Efficiency is a primary consideration, and so is
simplicity of the implementation.
Do you have any advice about the library I should be using?
Thanks,

Luca

[-- Attachment #2: Type: text/html, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Fast XML parser
  2007-07-18 21:58 Fast XML parser Luca de Alfaro
@ 2007-07-18 22:11 ` Gabriel Kerneis
  2007-07-18 22:48   ` Till Varoquaux
  2007-07-19 11:38 ` Richard Jones
  2007-07-20  7:01 ` Jon Harrop
  2 siblings, 1 reply; 7+ messages in thread
From: Gabriel Kerneis @ 2007-07-18 22:11 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 624 bytes --]

Le Wed, 18 Jul 2007 14:58:35 -0700, "Luca de Alfaro"
<luca@dealfaro.org> a écrit :
> I am interested in parsing Wiki markup language that has a few tags,
> like <pre>...</pre>, <math>...,</math>.
> These tags are sparse, meaning that the ratio of number of tags /
> number of bytes is low.
> I would like, given a string (or a stream) with such tags, to parse
> it as fast as possible.  Efficiency is a primary consideration, and
> so is simplicity of the implementation.
> Do you have any advice about the library I should be using?

You want it simple, you want it light : Xml-light.

Regards,
-- 
Gabriel

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Fast XML parser
  2007-07-18 22:11 ` [Caml-list] " Gabriel Kerneis
@ 2007-07-18 22:48   ` Till Varoquaux
  2007-07-19  6:24     ` Gabriel Kerneis
  0 siblings, 1 reply; 7+ messages in thread
From: Till Varoquaux @ 2007-07-18 22:48 UTC (permalink / raw)
  To: Gabriel Kerneis; +Cc: caml-list

Ouch,

I beg to differ, if you want speed and can work stream (linear
top-down left-right exploration of the graph), you want an event based
xml parser. expat is probably one of the fastest (the c library is
known to be a speed demon). PXP does everything including talking
klingon and controlling the kitchen sink. It provides an event based
layer.
I have found Xml-light to be the simplest parser. Alas, it is so
simple it is far from implementing the full XML 1.1 specification.
This often isn't an issue since most XML files are written in a very
small subset of what the language.

Ultimately if you are parsing very simple files and are aiming for
pure speed you could write a simple lexer with ocamllex and use that
as base layer.

On 7/19/07, Gabriel Kerneis <gabriel.kerneis@enst.fr> wrote:
> Le Wed, 18 Jul 2007 14:58:35 -0700, "Luca de Alfaro"
> <luca@dealfaro.org> a écrit :
> > I am interested in parsing Wiki markup language that has a few tags,
> > like <pre>...</pre>, <math>...,</math>.
> > These tags are sparse, meaning that the ratio of number of tags /
> > number of bytes is low.
> > I would like, given a string (or a stream) with such tags, to parse
> > it as fast as possible.  Efficiency is a primary consideration, and
> > so is simplicity of the implementation.
> > Do you have any advice about the library I should be using?
>
> You want it simple, you want it light : Xml-light.
>
> Regards,
> --
> Gabriel
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>
>
>


-- 
http://till-varoquaux.blogspot.com/


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Fast XML parser
  2007-07-18 22:48   ` Till Varoquaux
@ 2007-07-19  6:24     ` Gabriel Kerneis
  2007-07-19  9:02       ` Till Varoquaux
  0 siblings, 1 reply; 7+ messages in thread
From: Gabriel Kerneis @ 2007-07-19  6:24 UTC (permalink / raw)
  To: Till Varoquaux, caml-list

[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]

Le Thu, 19 Jul 2007 00:48:07 +0200, "Till Varoquaux"
<till.varoquaux@gmail.com> a écrit :
> Ouch,
> 
> I beg to differ, if you want speed and can work stream (linear
> top-down left-right exploration of the graph), you want an event based
> xml parser. expat is probably one of the fastest (the c library is
> known to be a speed demon). PXP does everything including talking
> klingon and controlling the kitchen sink. It provides an event based
> layer.

I certainly wouldn't recommend xml-light for *every* project where an
XML parser is needed, but look at the OP's requirements :
> > > I am interested in parsing Wiki markup language that has a few
> > > tags, like <pre>...</pre>, <math>...,</math>.
> > > These tags are sparse, meaning that the ratio of number of tags /
> > > number of bytes is low.
On such a simple case, xml-light (which is basically a simple ocamllex
file + a few things to build the syntax tree) should perform quite
well. I know it doesn't handle DTD, etc. but in *that* case, who cares ?

> Ultimately if you are parsing very simple files and are aiming for
> pure speed you could write a simple lexer with ocamllex and use that
> as base layer.

That could be a solution, and (provided the licence you chose for your
project is compatible) you could even use xml-light as an example to
begin with (stripping things you don't need).

Kind regards,
-- 
Gabriel

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Fast XML parser
  2007-07-19  6:24     ` Gabriel Kerneis
@ 2007-07-19  9:02       ` Till Varoquaux
  0 siblings, 0 replies; 7+ messages in thread
From: Till Varoquaux @ 2007-07-19  9:02 UTC (permalink / raw)
  To: caml-list

OOps fogot to "reply to all". Here we go again:

On 7/19/07, Gabriel Kerneis <gabriel.kerneis@enst.fr> wrote:
>
> I certainly wouldn't recommend xml-light for *every* project where an
> XML parser is needed, but look at the OP's requirements :
> > > > I am interested in parsing Wiki markup language that has a few
> > > > tags, like <pre>...</pre>, <math>...,</math>.
> > > > These tags are sparse, meaning that the ratio of number of tags /
> > > > number of bytes is low.
> On such a simple case, xml-light (which is basically a simple ocamllex
> file + a few things to build the syntax tree) should perform quite
> well. I know it doesn't handle DTD, etc. but in *that* case, who cares ?
>
Xml-light would indeed provide a very simple parser and pretty good
speed. Whether to use it vs an event based parser is a matter of how
big these files really are (if they are not huge you shouldn't see a
real difference so you might as well keep it simple).

As for compliance, xml-light sort of does DTD. The issue is a lot more
subtle: it drops many features from the xml standard (including
encoding declaration) and thus will reject many valid xml documents.
This is, off course, not tolerable when you have to accepts documents
from sources other than your program... I wouldn't recommend xml-light
for any serious project reading xml files from the open. It can
however be great when you have control over the source generating your
documents (ie documents generated by xml-light itself).
> > Ultimately if you are parsing very simple files and are aiming for
> > pure speed you could write a simple lexer with ocamllex and use that
> > as base layer.
>
> That could be a solution, and (provided the licence you chose for your
> project is compatible) you could even use xml-light as an example to
> begin with (stripping things you don't need).

Indeed, and that should be real quick to do since the source code is
simple and easy to read. I should have mentioned it.

Cheers,
Til


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Fast XML parser
  2007-07-18 21:58 Fast XML parser Luca de Alfaro
  2007-07-18 22:11 ` [Caml-list] " Gabriel Kerneis
@ 2007-07-19 11:38 ` Richard Jones
  2007-07-20  7:01 ` Jon Harrop
  2 siblings, 0 replies; 7+ messages in thread
From: Richard Jones @ 2007-07-19 11:38 UTC (permalink / raw)
  To: Luca de Alfaro; +Cc: caml-list

On Wed, Jul 18, 2007 at 02:58:35PM -0700, Luca de Alfaro wrote:
> I am interested in parsing Wiki markup language that has a few tags, like
> <pre>...</pre>, <math>...,</math>.
> These tags are sparse, meaning that the ratio of number of tags / number of
> bytes is low.
> I would like, given a string (or a stream) with such tags, to parse it as
> fast as possible.  Efficiency is a primary consideration, and so is
> simplicity of the implementation.
> Do you have any advice about the library I should be using?

There's some code in COCANWIKI which does exactly this:

http://sandbox.merjis.com/release

Look at the file scripts/lib/wikilib.ml.

It's not a particularly clever implementation, but it has a great deal
of testing in the real world.

As well as <xml>-like syntax it also does a lot of standard wiki
syntax like '* ' for bullet points, paragraphs, indents for
preformatted sections and so on.  And it outputs pure unadulterated
XHTML.

Rich.

-- 
Richard Jones
Red Hat


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Fast XML parser
  2007-07-18 21:58 Fast XML parser Luca de Alfaro
  2007-07-18 22:11 ` [Caml-list] " Gabriel Kerneis
  2007-07-19 11:38 ` Richard Jones
@ 2007-07-20  7:01 ` Jon Harrop
  2 siblings, 0 replies; 7+ messages in thread
From: Jon Harrop @ 2007-07-20  7:01 UTC (permalink / raw)
  To: caml-list

On Wednesday 18 July 2007 22:58:35 Luca de Alfaro wrote:
> I am interested in parsing Wiki markup language that has a few tags, like
> <pre>...</pre>, <math>...,</math>.
> These tags are sparse, meaning that the ratio of number of tags / number of
> bytes is low.
> I would like, given a string (or a stream) with such tags, to parse it as
> fast as possible.  Efficiency is a primary consideration, and so is
> simplicity of the implementation.
> Do you have any advice about the library I should be using?
> Thanks,

I would just use XML-Light.

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
OCaml for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists/?e


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-07-20  8:08 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-18 21:58 Fast XML parser Luca de Alfaro
2007-07-18 22:11 ` [Caml-list] " Gabriel Kerneis
2007-07-18 22:48   ` Till Varoquaux
2007-07-19  6:24     ` Gabriel Kerneis
2007-07-19  9:02       ` Till Varoquaux
2007-07-19 11:38 ` Richard Jones
2007-07-20  7:01 ` Jon Harrop

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).