From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105])
	by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id q45EmtOl007128;
	Sat, 5 May 2012 16:48:55 +0200
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Ap8CAJ09pU9KN1ZKnGdsb2JhbABFhXKtJAEBAQEBCAsJHSeCNoELAhgOAj8KFgEaiAYEB6djkiuBL4lkhHQ1YwSXD4RSjUqBVQ
X-IronPort-AV: E=Sophos;i="4.75,536,1330902000"; 
   d="scan'208";a="142683663"
Received: from mail6.webfaction.com (HELO smtp.webfaction.com) ([74.55.86.74])
  by mail4-smtp-sop.national.inria.fr with ESMTP; 05 May 2012 16:48:32 +0200
Received: from heyho.local (21-232.197-178.cust.bluewin.ch [178.197.232.21])
	by smtp.webfaction.com (Postfix) with ESMTP id B34C326EBDC4;
	Sat,  5 May 2012 09:48:29 -0500 (CDT)
Date: Sat, 5 May 2012 16:48:23 +0200
From: =?utf-8?Q?Daniel_B=C3=BCnzli?= <daniel.buenzli@erratique.ch>
To: caml-list <caml-list@inria.fr>, caml-hump@inria.fr
Message-ID: <7B487312415740C698D68EB7E79F1298@erratique.ch>
X-Mailer: sparrow 1.5 (build 1043.1)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Subject: [Caml-list] [ANN] Uutf 0.9.0 and Jsonm 0.9.0

Hello, 

I'd like to announce the following two modules. First Uutf:

 Uutf is a non-blocking streaming codec to decode and encode the UTF-8,
 UTF-16, UTF-16LE and UTF-16BE encoding schemes. It can efficiently
 work character by character without blocking on IO. Decoders perform 
 character position tracking and support newline normalization. 

 
 Functions are also provided to fold over the characters of UTF encoded
 OCaml string values and to directly encode characters in OCaml Buffer.t
 values.


 Uutf is made of a single, independent, module and distributed under
 the BSD3 license.

Project home page: http://erratique.ch/software/uutf
API doc & examples: http://erratique.ch/software/uutf/doc/Uutf

The aim of Uutf is to provide a convenient abstraction for
non-blocking streaming Unicode text processing and to implement
non-blocking LL(k) parsers over Unicode text. It's used by Jsonm and
will certainly be used by Xmlm in the future.

The second module is Jsonm:

 Jsonm is a non-blocking streaming codec to decode and encode the JSON
 data format. It can process JSON text without blocking on IO and
 without a complete in-memory representation of the data.


 The alternative "uncut" codec also processes whitespace and
 (non-standard) JSON with JavaScript comments.

 
 Jsonm is made of a single module and depends on [Uutf]. It is distributed 
 under the BSD3 license.

Project home page: http://erratique.ch/software/jsonm
API doc & examples: http://erratique.ch/software/jsonm/doc/Jsonm

Basically Jsonm is to JSON what Xmlm is to XML. It's a rather
low-level approach where you work with streams of structural lexemes
which reflect the data model underlying the data language. The
sequence of lexemes is guaranteed to be presented to you according to
a simple grammar or errors are returned. This allows to
consume/produce the data without having the whole data in memory while
abstracting over the idiosyncrasies of the data language. I also hope
it can serve as basis to define efficient data query combinators.

Jsonm's design is however more convient than Xmlm's one: Jsonm has
precise lexeme position tracking support, best-effort decoding that
allows to continue after an error, trivial input termination condition
(just decode `End, whereas in Xmlm you have to count), and allows to
access whitespace to write data filters that preserve as much of the
original data as possible (P.S. I hope to eventually find time to fix
all these defects in an incompatible release of Xmlm).

If you want to install these modules via odb here are lines you can 
add to your odb package file:

http://erratique.ch/software/odb-packages.txt

Feedback is welcome, 

Daniel


P.S. Since the question will likely be asked here's how I think Jsonm
compares to Yojson. Martin may want to chime in to correct me or offer
a different perspective as I'm certainly biased.

* Jsonm depends on Uutf. Yojson depends on ocamllex, cppo, easy-format and 
 biniou. 

* Jsonm inputs UTF-8, UTF16, UTF-16LE and UTF-16BE and outputs
 UTF-8 encoded JSON. Yojson inputs and outputs UTF-8 encoded
 JSON. 

* Jsonm reports character stream decoding errors and allows to bypass
 them by replacing the invalid bytes with the Unicode character
 replacement U+FFFD. Yojson, apparently by design, silently inputs
 invalid UTF-8 byte sequences, I consider this to be a security risk
 (or at least a wrong security default).

* Jsonm mostly sticks to the standard (with the exception of comments if 
 you use the uncut codec). Yojson extends the standard in various
 ways to support the serialisation of OCaml values, it also 
 supports the input of JavaScript comments but discards them. 

* Jsonm uses only OCaml floats for JSON numbers. This limits the
 roundtrip of integers to the ones that are exactly representable in
 this datatype. i.e. the range [-2^53;2^53]. Yojson returns the
 integer string literal if the int is greater than [max_int]. Note
 however that Jsonm's behaviour is equivalent to the one you have in
 JavaScript (and hence in all browsers) so it would anyway be
 ill-advised for JSON producers to go beyond this limit.

* Jsonm offers no generic tree-like JSON representation (see the examples in 
 the doc to see how to build one). Yojson offers many different generic
 tree-like representations. 

* Jsonm has a non-blocking IO interface. To the best of my knowledge
 Yojson doesn't support that. 

* Jsonm has a streaming IO interface. To the best of my knowledge
 there's an undocumented very low-level streaming input interface in
 Yojson but this bares no ressemblance with Jsonm's notion of a
 streaming interface. There's also an undocumented streaming output
 interface but it doesn't seem that you can output an object or an
 array without first building an in-memory JSON representation of it.

* Jsonm can perform best-effort decoding, i.e. continue to parse after
 an error. To the best of my knowledge Yojson cannot do that.

* Performance. I'm always reluctant to make performance claims in
 abstract settings; it all depends on the context. If you like
 unscientific benchmarks you can try to test performance between
 `ydump` and `jsontrip` which both recode JSON text. Bear in mind
 however that the results are highly data dependent and that
 internally both programs don't do the same thing, `jsontrip` does
 not build a generic in-memory representation of the JSON text. In
 my tests on random data `jsontrip` takes anything between 1.25 and
 2.1 the time of `ydump`. The upper bound occurs when random numbers 
 are only integers which `ydump` doesn't parse as floats. On real
 geojson data these numbers are between 1.38 and 1.46. But on this
 data, processing a 325Mo file, the resident memory used by `ydump`
 grows up to 1.2Go while the streaming interface of Jsonm albeit
 slower, remains constant at only 3.8Mo. Your mileage may vary.