From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by yquem.inria.fr (Postfix) with ESMTP id D0DD6BBCA for ; Fri, 29 Feb 2008 11:19:12 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAAG9qx0fAXQInh2dsb2JhbACQbwEBAQgKKYENmzY X-IronPort-AV: E=Sophos;i="4.25,425,1199660400"; d="scan'208";a="9741561" Received: from concorde.inria.fr ([192.93.2.39]) by mail3-smtp-sop.national.inria.fr with ESMTP; 29 Feb 2008 11:19:12 +0100 Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by concorde.inria.fr (8.13.6/8.13.6) with ESMTP id m1TAJAtg002442 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=OK) for ; Fri, 29 Feb 2008 11:19:12 +0100 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ao8CAB5rx0fBL1AZYmdsb2JhbACQYwIVBAYICBmBDZsp X-IronPort-AV: E=Sophos;i="4.25,425,1199660400"; d="scan'208";a="8773265" Received: from gw.exalead.com (HELO exalead.com) ([193.47.80.25]) by mail1-smtp-roc.national.inria.fr with ESMTP; 29 Feb 2008 11:19:11 +0100 Received: from [192.168.204.148] (madpc064.exalead.com [192.168.204.148]) (authenticated bits=0) by exalead.com (8.14.0/8.14.0) with ESMTP id m1TAJ9cm019183; Fri, 29 Feb 2008 11:19:09 +0100 Message-ID: <47C7DC1D.9030600@exalead.com> Date: Fri, 29 Feb 2008 11:19:09 +0100 From: Berke Durak User-Agent: Thunderbird 1.5.0.10 (X11/20070221) MIME-Version: 1.0 To: Gabriel Kerneis , Caml-list List Subject: Re: [Caml-list] Long-term storage of values References: <191751.36007.qm@web54607.mail.re2.yahoo.com> <20080229084000.2d3a95fe@tatanka.kerneis.info> In-Reply-To: <20080229084000.2d3a95fe@tatanka.kerneis.info> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Miltered: at concorde with ID 47C7DC1E.000 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)! X-Spam: no; 0.00; berke:01 durak:01 berke:01 durak:01 marshalling:01 marshalling:01 combinator:01 combinators:01 parsers:01 popl:01 contrib:01 recursive:01 type-safe:01 serializer:01 mutable:01 Gabriel Kerneis a écrit : > Hello, > > Le Thu, 28 Feb 2008 20:14:27 -0500 (EST), Brian Hurt a > écrit : >> So, mistake number one: either use the data, and structure your data >> (at that layer) to take advantage of it, or don't use a database. >> [...] >> So that's mistake number two: you're communicating between different >> versions of the program with an ill-defined (at best) and not >> generic protocol/file format. > > Right. But imagine you're communicating with yourself (saving and > restoring data). And you need to retrieve the data *efficiently*. > Converting from a generic file format is not efficient - not if you have > to retrieve the data 100 times per second (imagine a CMS on a popular > website). It would be faster to use an internal file format. And, of > course, keeping a backup in a generic file format. > > But now, here is the big deal: when your internal data structure > changes (and it might not even be under your control, imagine you're > using a third-party library), you have to convert your generic backup > to the newer internal format. And if you have, say, an awful lot of > backups, it might take soooo long... Of course, it's only once in a > while, but when this happens, how do you deal with it? Remember the > efficiency is the key point, here. > > I don't think there is a generic solution to this problem. But I'm just > pointing out the underlying requirements, in case some would have one. > > Regards, I had exactly those kinds of issues during my work at the EDOS project. We were basically treating the metadata for every dat of every component of every architecture of every Debian distribution. I started using SQL. I first tried MySQL then PostgreSQL. Fill performance was bad. (And didn't get fast enough when Jaap added prepared statements for postgresql). Query performance was worse, because we were interested in things like the transitive dependency closure, and SQL is well-known to not be suitable for such transitive properties. Maybe I should have tried Sqlite but I didn't know it at the time. I then switched to marshalling. Boy, that was fast! Of course I got bitten very hard by segmentation faults... sorry, Xavier, for bothering you about such stupidity. And yes, I had a version number in my datastructures but, no, it was not automatically updated. Also, adding a field was really, really painful. Marshalling could work when your data structures are stable, but during development, it is painful, especially if you can't use a toy data set for development. I then built an I/O combinator library. Of course, combinators already exist for defining parsers and printers, but I thought that it would be great to combine them both. (Such an idea has been presented at POPL under the term of "picklers" if I recall correctly.) Basically you describe your types using "literates" which can then be used for reading and writing, as in type my_litterate = io_pair (io_list io_int) (io_hasthbl io_int io_string) Anyway, that led to the "IO" module available here http://caml.inria.fr/cgi-bin/hump.en.cgi?contrib=537 and a better version still lives in the EDOS codebase, which is now maintained by Jaap Boender. I think Jaap made a GODI package for it. IO can use multiple backends (binary, ASCII) and that was quite useful. Performance is reasonably good. One problem is with records and sum types - you have to give pattern-matching functions to define readers/writers for them. However, you can decide how you handle missing values or extra fields. Martin Jambon's JSON wheel could also be used here, but I'm waiting for the preprocessor situation to stabilize a little bit... One major drawback of IO is that it does not handle sharing or recursive structures. And I know of no efficient, type-safe way of handling those, especially if you don't want the serializer to add extra sharing (for instance if you have distinct mutable records with the same contents) I am still thinking about these problems. For instance, I have been recently working on Wikipedia revision history (only the Turkish (15 GB) and French ones (200 GB), the English one is too big). I needed maximal efficiency so I used Marshal, which was pretty fast. Of course I added a small version header. I noticed that when you Marshal a closure, the module puts a MD5 of presumably the whole program in the output (which costs some bytes but that's another problem.) It would be nice to have a mechanism for marshalling a value with an MD5 of its types, and unmarshalling only then the MD5 matches. In non-malicious environments, that is, in environments where no-one fakes MD5s, that would be quite safe. Of course it won't be possible to marshal polymorphic types. So, I'm asking the experts: is it possible to have a very small extension that would allow this? We could have the following restrictions and it still would be very useful: - the type t must be defined in a module M, which must live in a separate compilation unit (i.e. have an available cmi) - the type t must be ground - the type t must be named in an interface - it must be explicitly named when marshalling/unmarshalling - this is an ugly hack that only works in Marshal Something along the lines of : safer_output_to_channel stdout (z : Interface.t) [] Trying safer_output_to_channel stdout z [] Would yield safer_output_to_channel stdout z [] ^ Error: value must be explicitly annotated with a type safer_output_to_channel stdout (33 : int) [] Error: Type needs to be defined in an external interface to be Marshallable. safer_output_to_channel stdout (bar : Foo.t) [] Error: Module Foo.t has no interface safer_output_to_channel stdout (bar : int Foo.t) [] Error: Type Foo.t is not ground -- Berke DURAK