caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: oleg@okmij.org
To: no263@dpmms.cam.ac.uk
Cc: Francois.Pottier@inria.fr, caml-list@inria.fr,
	daniel.buenzli@erratique.ch
Subject: Re: [Caml-list] New release of Menhir (20141215)
Date: Mon, 22 Dec 2014 06:13:46 -0500 (EST)	[thread overview]
Message-ID: <20141222111346.68AD4C3829@www1.g3.pair.com> (raw)
In-Reply-To: <CAPunWhD=DgnPRXJo60ppx_sGbGeVbzYSXDCqffk2dMKJ8K=Vdw@mail.gmail.com>


Regarding incremental parsing of protocols like IMAP: I have
successfully (as in successfully deployed in production and being used
around the clock, since about 2010 or so) used iteratees for
incremental parsing of full XML, including CDATA, parsed entities and
namespaces. The full XML is actually quite difficult to parse: for
example, parsed entity references like &amp; are not recognized within
CDATA blocks; the content of attributes has its own whitespace
handling rules. The parser is used for handling sometimes quite large
XML documents. The parser is incremental and so can work in constant
memory.

        http://okmij.org/ftp/Streams.html#xml

I have also used iteratees to parse HTTP Log files, also
incrementally. The log files have an (unintended, I hope) complication:
the user-agent string (quoted in the log) may, according to RFC,
itself contain quotes. Since the embedded quotes are not escaped
(again, according to RFC), we may end up with quoted strings
containing unescaped quote characters. Parsing will require unbounded
look-ahead then. Iteratees can handle that -- and report errors
precisely and recover.

        http://okmij.org/ftp/Streams.html#good-error
        http://okmij.org/ftp/Streams.html#fork

Incidentally, there are quite many iteratee libraries. Some, like
pipes, emphasize apparent simplicity and do no input buffering. The
performance is indeed pretty bad then. 

        I should also mention that a parser with a call-back interface
and the absence of visible side-effects can _automatically_ be made
incremental. The following web page describes incrementalization of
stdlib's Genlex lexer.
        http://okmij.org/ftp/continuations/differentiating-parsers.html



  parent reply	other threads:[~2014-12-22 11:13 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-17 20:14 Francois Pottier
2014-12-18 12:45 ` Gerd Stolpmann
2014-12-18 14:19   ` Nicolas Ojeda Bar
2014-12-18 15:20     ` Daniel Bünzli
2014-12-18 15:34       ` Simon Cruanes
2014-12-18 16:02         ` Nicolas Ojeda Bar
2014-12-18 15:25     ` Gerd Stolpmann
2014-12-18 17:25       ` Francois Pottier
2014-12-22 11:13     ` oleg [this message]
2014-12-22 18:40 ` Dario Teixeira
2014-12-24 23:30   ` Francois Pottier
2014-12-26 11:13     ` Dario Teixeira
2014-12-26 11:31       ` Frédéric Bour
2014-12-26 12:16         ` Dario Teixeira

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141222111346.68AD4C3829@www1.g3.pair.com \
    --to=oleg@okmij.org \
    --cc=Francois.Pottier@inria.fr \
    --cc=caml-list@inria.fr \
    --cc=daniel.buenzli@erratique.ch \
    --cc=no263@dpmms.cam.ac.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).