caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] [ANN] Lambda Soup 0.6 + Markup.ml 0.7 – Improved HTML5 processing
@ 2016-02-11 19:04 Anton Bachin
  0 siblings, 0 replies; only message in thread
From: Anton Bachin @ 2016-02-11 19:04 UTC (permalink / raw)
  To: caml users

Hello,

I would like to announce releases 0.6 of Lambda Soup, the CSS-selector-based
HTML scraper and rewriter, and 0.7 of Markup.ml, the streaming HTML and XML
parser.

  https://github.com/aantron/lambda-soup
  https://github.com/aantron/markup.ml


The main change in Lambda Soup is that is is now based on Markup.ml instead of
Ocamlnet. As a result,

- parsing now conforms closely to the HTML5 specification, including error
  recovery;
- HTML entity references are translated;
- encodings are detected automatically, Lambda Soup is no longer limited to
  ASCII-compatible input, and all strings emitted by the API are in UTF-8; and
- empty attributes are handled correctly.

Lambda Soup can now accept and emit Markup.ml parsing signal streams, so it can
be used for filters, without having to parse directly from or serialize all the
way to strings. It can also be used safely with XML. Parsing is, however, much
slower – this depends on Markup.ml being optimized in the future.


The HTML parser in Markup.ml, in turn, now implements the adoption agency
algorithm, an error recovery algorithm from the HTML5 specification that is
ill-suited for streaming parsers. It is also more thouroughly tested, and has
received many bugfixes.

I must thank Jerome Vouillon and Leo Wzukw for bug reports. They are greatly
appreciated.


Regards,
Anton


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2016-02-11 19:04 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-11 19:04 [Caml-list] [ANN] Lambda Soup 0.6 + Markup.ml 0.7 – Improved HTML5 processing Anton Bachin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).