From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <berke.durak@exalead.com>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled 
	version=3.1.3
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by yquem.inria.fr (Postfix) with ESMTP id D0DD6BBCA
	for <caml-list@yquem.inria.fr>; Fri, 29 Feb 2008 11:19:12 +0100 (CET)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AgAAAG9qx0fAXQInh2dsb2JhbACQbwEBAQgKKYENmzY
X-IronPort-AV: E=Sophos;i="4.25,425,1199660400"; 
   d="scan'208";a="9741561"
Received: from concorde.inria.fr ([192.93.2.39])
  by mail3-smtp-sop.national.inria.fr with ESMTP; 29 Feb 2008 11:19:12 +0100
Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82])
	by concorde.inria.fr (8.13.6/8.13.6) with ESMTP id m1TAJAtg002442
	(version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=OK)
	for <caml-list@inria.fr>; Fri, 29 Feb 2008 11:19:12 +0100
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Ao8CAB5rx0fBL1AZYmdsb2JhbACQYwIVBAYICBmBDZsp
X-IronPort-AV: E=Sophos;i="4.25,425,1199660400"; 
   d="scan'208";a="8773265"
Received: from gw.exalead.com (HELO exalead.com) ([193.47.80.25])
  by mail1-smtp-roc.national.inria.fr with ESMTP; 29 Feb 2008 11:19:11 +0100
Received: from [192.168.204.148] (madpc064.exalead.com [192.168.204.148])
	(authenticated bits=0)
	by exalead.com (8.14.0/8.14.0) with ESMTP id m1TAJ9cm019183;
	Fri, 29 Feb 2008 11:19:09 +0100
Message-ID: <47C7DC1D.9030600@exalead.com>
Date: Fri, 29 Feb 2008 11:19:09 +0100
From: Berke Durak <berke.durak@exalead.com>
User-Agent: Thunderbird 1.5.0.10 (X11/20070221)
MIME-Version: 1.0
To: Gabriel Kerneis <kerneis@enst.fr>,
	Caml-list List <caml-list@inria.fr>
Subject: Re: [Caml-list] Long-term storage of values
References: <191751.36007.qm@web54607.mail.re2.yahoo.com>	<Pine.LNX.4.64.0802281922420.16618@localhost> <20080229084000.2d3a95fe@tatanka.kerneis.info>
In-Reply-To: <20080229084000.2d3a95fe@tatanka.kerneis.info>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
X-Miltered: at concorde with ID 47C7DC1E.000 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)!
X-Spam: no; 0.00; berke:01 durak:01 berke:01 durak:01 marshalling:01 marshalling:01 combinator:01 combinators:01 parsers:01 popl:01 contrib:01 recursive:01 type-safe:01 serializer:01 mutable:01 

Gabriel Kerneis a écrit :
> Hello,
> 
> Le Thu, 28 Feb 2008 20:14:27 -0500 (EST), Brian Hurt <bhurt@spnz.org> a
> écrit :
>> So, mistake number one: either use the data, and structure your data
>> (at that layer) to take advantage of it, or don't use a database.
>> [...]
>> So that's mistake number two: you're communicating between different 
>> versions of the program with an ill-defined (at best) and not 
>> generic protocol/file format.
> 
> Right. But imagine you're communicating with yourself (saving and
> restoring data). And you need to retrieve the data *efficiently*.
> Converting from a generic file format is not efficient - not if you have
> to retrieve the data 100 times per second (imagine a CMS on a popular
> website). It would be faster to use an internal file format. And, of
> course, keeping a backup in a generic file format.
> 
> But now, here is the big deal: when your internal data structure
> changes (and it might not even be under your control, imagine you're
> using a third-party library), you have to convert your generic backup
> to the newer internal format. And if you have, say, an awful lot of
> backups, it might take soooo long... Of course, it's only once in a
> while, but when this happens, how do you deal with it? Remember the
> efficiency is the key point, here.
> 
> I don't think there is a generic solution to this problem. But I'm just
> pointing out the underlying requirements, in case some would have one.
> 
> Regards,

I had exactly those kinds of issues during my work at the EDOS project.
We were basically treating the metadata for every dat of every component
of every architecture of every Debian distribution.

I started using SQL.  I first tried MySQL then PostgreSQL.
Fill performance was bad.  (And didn't get fast enough when Jaap added
prepared statements for postgresql).
Query performance was worse, because we were interested in things like
the transitive dependency closure, and SQL is
well-known to not be suitable for such transitive properties.  Maybe I
should have tried Sqlite but I didn't know it at the time.

I then switched to marshalling.  Boy, that was fast!  Of course I got
bitten very hard by segmentation faults...  sorry, Xavier, for bothering
you about such stupidity.  And yes, I had a version number in my datastructures
but, no, it was not automatically updated.

Also, adding a field was really, really painful.  Marshalling could work when
your data structures are stable, but during development, it is painful,
especially if you can't use a toy data set for development.

I then built an I/O combinator library.  Of course, combinators already
exist for defining parsers and printers, but I thought that it would be
great to combine them both.  (Such an idea has been presented at POPL
under the term of "picklers" if I recall correctly.)

Basically you describe your types using "literates" which can then be used
for reading and writing, as in

   type my_litterate = io_pair (io_list io_int) (io_hasthbl io_int io_string)

Anyway, that led to the "IO" module available here

   http://caml.inria.fr/cgi-bin/hump.en.cgi?contrib=537

and a better version still lives in the EDOS codebase, which is now
maintained by Jaap Boender.  I think Jaap made a GODI package for it.

IO can use multiple backends (binary, ASCII) and that was quite useful.
Performance is reasonably good.  One problem is with records and sum types -
you have to give pattern-matching functions to define readers/writers for them.
However, you can decide how you handle missing values or extra fields.

Martin Jambon's JSON wheel could also be used here, but I'm waiting for
the preprocessor situation to stabilize a little bit...

One major drawback of IO is that it does not handle sharing or recursive
structures.  And I know of no efficient, type-safe way of handling those,
especially if you don't want the serializer to add extra sharing (for instance
if you have distinct mutable records with the same contents)

I am still thinking about these problems.  For instance, I have been recently
working on Wikipedia revision history (only the Turkish (15 GB) and French ones
(200 GB), the English one is too big).  I needed maximal efficiency so I used Marshal, which was pretty
fast.  Of course I added a small version header.

I noticed that when you Marshal a closure, the module puts a MD5 of
presumably the whole program in the output (which costs some bytes but that's
another problem.)

It would be nice to have a mechanism for marshalling a value with an MD5 of its
types, and unmarshalling only then the MD5 matches.  In non-malicious environments,
that is, in environments where no-one fakes MD5s, that would be quite safe.

Of course it won't be possible to marshal polymorphic types.

So, I'm asking the experts: is it possible to have a very small extension that would allow this?
We could have the following restrictions and it still would be very useful:

   - the type t must be defined in a module M, which must live in a separate compilation unit
     (i.e. have an available cmi)
   - the type t must be ground
   - the type t must be named in an interface
   - it must be explicitly named when marshalling/unmarshalling
   - this is an ugly hack that only works in Marshal

Something along the lines of :

   safer_output_to_channel stdout (z : Interface.t) []

Trying

   safer_output_to_channel stdout z []

Would yield

   safer_output_to_channel stdout z []
                                  ^
Error: value must be explicitly annotated with a type

safer_output_to_channel stdout (33 : int) []
Error: Type needs to be defined in an external interface to be Marshallable.

safer_output_to_channel stdout (bar : Foo.t) []
Error: Module Foo.t has no interface

safer_output_to_channel stdout (bar : int Foo.t) []
Error: Type Foo.t is not ground

-- 
Berke DURAK