From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from discorde.inria.fr (discorde.inria.fr [192.93.2.38]) by yquem.inria.fr (Postfix) with ESMTP id F0831BC69 for ; Thu, 19 Jul 2007 08:24:29 +0200 (CEST) Received: from smtp4-g19.free.fr (smtp4-g19.free.fr [212.27.42.30]) by discorde.inria.fr (8.13.6/8.13.6) with ESMTP id l6J6OTqn008128 for ; Thu, 19 Jul 2007 08:24:29 +0200 Received: from kerneis.info (kerneis.info [82.224.215.18]) by smtp4-g19.free.fr (Postfix) with ESMTP id 34D5768F05; Thu, 19 Jul 2007 08:24:29 +0200 (CEST) Received: from localhost.lan ([127.0.0.1] helo=tatanka.kerneis.info ident=gabriel) by kerneis.info with esmtp (Exim 4.63) (envelope-from ) id 1IBPR6-0000rx-I4; Thu, 19 Jul 2007 08:24:28 +0200 Date: Thu, 19 Jul 2007 08:24:21 +0200 From: Gabriel Kerneis To: "Till Varoquaux" , caml-list@yquem.inria.fr Subject: Re: [Caml-list] Fast XML parser In-Reply-To: <9d3ec8300707181548n2c7ffa01xa9d2bea20c90c056@mail.gmail.com> References: <28fa90930707181458p26eac6e6y7b45018b7c91ca65@mail.gmail.com> <9d3ec8300707181548n2c7ffa01xa9d2bea20c90c056@mail.gmail.com> Organization: ENST X-Mailer: Claws Mail 2.10.0 (GTK+ 2.10.13; i486-pc-linux-gnu) Mime-Version: 1.0 Content-Type: multipart/signed; boundary=Sig_qrmakQdiK951v60LQ2iIv+s; protocol="application/pgp-signature"; micalg=PGP-SHA1 Message-Id: X-Miltered: at discorde with ID 469F039D.000 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)! X-Spam: no; 0.00; parser:01 0200,:01 top-down:01 parser:01 pxp:01 klingon:01 wiki:01 ocamllex:01 syntax:01 lexer:01 ocamllex:01 951:98 expat:98 ...,:98 951:98 --Sig_qrmakQdiK951v60LQ2iIv+s Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Le Thu, 19 Jul 2007 00:48:07 +0200, "Till Varoquaux" a =E9crit : > Ouch, >=20 > I beg to differ, if you want speed and can work stream (linear > top-down left-right exploration of the graph), you want an event based > xml parser. expat is probably one of the fastest (the c library is > known to be a speed demon). PXP does everything including talking > klingon and controlling the kitchen sink. It provides an event based > layer. I certainly wouldn't recommend xml-light for *every* project where an XML parser is needed, but look at the OP's requirements : > > > I am interested in parsing Wiki markup language that has a few > > > tags, like
...
, ...,. > > > These tags are sparse, meaning that the ratio of number of tags / > > > number of bytes is low. On such a simple case, xml-light (which is basically a simple ocamllex file + a few things to build the syntax tree) should perform quite well. I know it doesn't handle DTD, etc. but in *that* case, who cares ? > Ultimately if you are parsing very simple files and are aiming for > pure speed you could write a simple lexer with ocamllex and use that > as base layer. That could be a solution, and (provided the licence you chose for your project is compatible) you could even use xml-light as an example to begin with (stripping things you don't need). Kind regards, --=20 Gabriel --Sig_qrmakQdiK951v60LQ2iIv+s Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGnwOb6a2JmXQu5bYRAr2XAKCAX4U+5tZ4quT+v6tQu7/FzXAHzgCgxEtk N0sHIvkMQU8F957AOkLkeJE= =/szn -----END PGP SIGNATURE----- --Sig_qrmakQdiK951v60LQ2iIv+s--