caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] [ANN] Markup.ml - HTML5 and XML parsers with error recovery
@ 2016-01-15 16:51 Anton Bachin
  0 siblings, 0 replies; only message in thread
From: Anton Bachin @ 2016-01-15 16:51 UTC (permalink / raw)
  To: caml-list

Good time of day,

I would like to announce the release of Markup.ml, a pair of streaming,
error-recovering parsers for HTML and XML. Usage is simple, like this:

  (* Pretty-print HTML, with error correction. *)

  open Markup

  channel stdin
  |> parse_html
  |> signals
  |> pretty_print
  |> write_html
  |> to_channel stdout

and

  (* Show up to 10 XML errors to the user and abort early. *)

  let report =
    let count = ref 0 in
    fun location error ->
      error |> Error.to_string ~location |> prerr_endline;
      count := !count + 1;
      if !count >= 10 then raise_notrace Exit

  string "some xml" |> parse_xml ~report |> signals |> drain

While still providing an easy basic interface, the parsers are
non-blocking and can be readily used with threading libraries such as
Lwt. For example, if "s" is a char Lwt_stream.t:

  (* Assemble HTML into a tree asynchronously. *)

  type html = Text of string | Element of string * html list

  Markup_lwt.lwt_stream s
  |> parse_html
  |> signals
  |> Markup_lwt.tree
    ~text:(fun ss -> Text (String.concat "" ss))
    ~element:(fun (_, name) _ children -> Element (name, children))
  >>= (fun tree -> ...)

The parsers detect input encodings automatically. Everything is
converted to UTF-8.

Markup.ml aims at standard conformance. See the conformance status [1].
Modulo any bugs, Markup.ml should already be highly conformant, the
only significant missing pieces being the two error recovery algorithms
listed for HTML (Markup.ml already performs the rest of HTML error
recovery).

The library can be found here:

  https://github.com/aantron/markup.ml

To install:

  opam install markup

Documentation is at:

  http://aantron.github.io/markup.ml

Apart from ordinary improvements to the library, there are several
possible avenues of future work:

- An HTML5/XHTML polyglot serializer.
- Parsing of XML doctype declarations for a validation library built on
  top of Markup.ml.
- An Async interface (mainly just applying a functor, but I am not
  experienced with Async at the moment).
- Factoring out the stream and I/O portions of Markup.ml into their own
  library or libraries.

Bug reports and contributions are greatly appreciated.

This work was prompted by Lambda Soup. That library could use a good,
modern HTML parser, and several people also commented on the need.

Markup.ml depends on the excellent Uutf by Daniel Buenzli. I'd also
like to thank Daniel for giving useful early feedback on the library
in the last couple of days.

Regards,
Anton


[1]: http://aantron.github.io/markup.ml/#2_Conformancestatus


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2016-01-15 16:51 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-15 16:51 [Caml-list] [ANN] Markup.ml - HTML5 and XML parsers with error recovery Anton Bachin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).