From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id q45EmtOl007128; Sat, 5 May 2012 16:48:55 +0200 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap8CAJ09pU9KN1ZKnGdsb2JhbABFhXKtJAEBAQEBCAsJHSeCNoELAhgOAj8KFgEaiAYEB6djkiuBL4lkhHQ1YwSXD4RSjUqBVQ X-IronPort-AV: E=Sophos;i="4.75,536,1330902000"; d="scan'208";a="142683663" Received: from mail6.webfaction.com (HELO smtp.webfaction.com) ([74.55.86.74]) by mail4-smtp-sop.national.inria.fr with ESMTP; 05 May 2012 16:48:32 +0200 Received: from heyho.local (21-232.197-178.cust.bluewin.ch [178.197.232.21]) by smtp.webfaction.com (Postfix) with ESMTP id B34C326EBDC4; Sat, 5 May 2012 09:48:29 -0500 (CDT) Date: Sat, 5 May 2012 16:48:23 +0200 From: =?utf-8?Q?Daniel_B=C3=BCnzli?= To: caml-list , caml-hump@inria.fr Message-ID: <7B487312415740C698D68EB7E79F1298@erratique.ch> X-Mailer: sparrow 1.5 (build 1043.1) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Subject: [Caml-list] [ANN] Uutf 0.9.0 and Jsonm 0.9.0 Hello, I'd like to announce the following two modules. First Uutf: Uutf is a non-blocking streaming codec to decode and encode the UTF-8, UTF-16, UTF-16LE and UTF-16BE encoding schemes. It can efficiently work character by character without blocking on IO. Decoders perform character position tracking and support newline normalization. Functions are also provided to fold over the characters of UTF encoded OCaml string values and to directly encode characters in OCaml Buffer.t values. Uutf is made of a single, independent, module and distributed under the BSD3 license. Project home page: http://erratique.ch/software/uutf API doc & examples: http://erratique.ch/software/uutf/doc/Uutf The aim of Uutf is to provide a convenient abstraction for non-blocking streaming Unicode text processing and to implement non-blocking LL(k) parsers over Unicode text. It's used by Jsonm and will certainly be used by Xmlm in the future. The second module is Jsonm: Jsonm is a non-blocking streaming codec to decode and encode the JSON data format. It can process JSON text without blocking on IO and without a complete in-memory representation of the data. The alternative "uncut" codec also processes whitespace and (non-standard) JSON with JavaScript comments. Jsonm is made of a single module and depends on [Uutf]. It is distributed under the BSD3 license. Project home page: http://erratique.ch/software/jsonm API doc & examples: http://erratique.ch/software/jsonm/doc/Jsonm Basically Jsonm is to JSON what Xmlm is to XML. It's a rather low-level approach where you work with streams of structural lexemes which reflect the data model underlying the data language. The sequence of lexemes is guaranteed to be presented to you according to a simple grammar or errors are returned. This allows to consume/produce the data without having the whole data in memory while abstracting over the idiosyncrasies of the data language. I also hope it can serve as basis to define efficient data query combinators. Jsonm's design is however more convient than Xmlm's one: Jsonm has precise lexeme position tracking support, best-effort decoding that allows to continue after an error, trivial input termination condition (just decode `End, whereas in Xmlm you have to count), and allows to access whitespace to write data filters that preserve as much of the original data as possible (P.S. I hope to eventually find time to fix all these defects in an incompatible release of Xmlm). If you want to install these modules via odb here are lines you can add to your odb package file: http://erratique.ch/software/odb-packages.txt Feedback is welcome, Daniel P.S. Since the question will likely be asked here's how I think Jsonm compares to Yojson. Martin may want to chime in to correct me or offer a different perspective as I'm certainly biased. * Jsonm depends on Uutf. Yojson depends on ocamllex, cppo, easy-format and biniou. * Jsonm inputs UTF-8, UTF16, UTF-16LE and UTF-16BE and outputs UTF-8 encoded JSON. Yojson inputs and outputs UTF-8 encoded JSON. * Jsonm reports character stream decoding errors and allows to bypass them by replacing the invalid bytes with the Unicode character replacement U+FFFD. Yojson, apparently by design, silently inputs invalid UTF-8 byte sequences, I consider this to be a security risk (or at least a wrong security default). * Jsonm mostly sticks to the standard (with the exception of comments if you use the uncut codec). Yojson extends the standard in various ways to support the serialisation of OCaml values, it also supports the input of JavaScript comments but discards them. * Jsonm uses only OCaml floats for JSON numbers. This limits the roundtrip of integers to the ones that are exactly representable in this datatype. i.e. the range [-2^53;2^53]. Yojson returns the integer string literal if the int is greater than [max_int]. Note however that Jsonm's behaviour is equivalent to the one you have in JavaScript (and hence in all browsers) so it would anyway be ill-advised for JSON producers to go beyond this limit. * Jsonm offers no generic tree-like JSON representation (see the examples in the doc to see how to build one). Yojson offers many different generic tree-like representations. * Jsonm has a non-blocking IO interface. To the best of my knowledge Yojson doesn't support that. * Jsonm has a streaming IO interface. To the best of my knowledge there's an undocumented very low-level streaming input interface in Yojson but this bares no ressemblance with Jsonm's notion of a streaming interface. There's also an undocumented streaming output interface but it doesn't seem that you can output an object or an array without first building an in-memory JSON representation of it. * Jsonm can perform best-effort decoding, i.e. continue to parse after an error. To the best of my knowledge Yojson cannot do that. * Performance. I'm always reluctant to make performance claims in abstract settings; it all depends on the context. If you like unscientific benchmarks you can try to test performance between `ydump` and `jsontrip` which both recode JSON text. Bear in mind however that the results are highly data dependent and that internally both programs don't do the same thing, `jsontrip` does not build a generic in-memory representation of the JSON text. In my tests on random data `jsontrip` takes anything between 1.25 and 2.1 the time of `ydump`. The upper bound occurs when random numbers are only integers which `ydump` doesn't parse as floats. On real geojson data these numbers are between 1.38 and 1.46. But on this data, processing a 325Mo file, the resident memory used by `ydump` grows up to 1.2Go while the streaming interface of Jsonm albeit slower, remains constant at only 3.8Mo. Your mileage may vary.