From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=AWL,HTML_10_20,HTML_MESSAGE autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by yquem.inria.fr (Postfix) with ESMTP id 1E9DBBBAF for ; Sat, 31 May 2008 18:54:58 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ag8EAEwfQUjAXQImiGdsb2JhbACCNjePRQEBAQ8gmmI X-IronPort-AV: E=Sophos;i="4.27,571,1204498800"; d="scan'208";a="11378980" Received: from discorde.inria.fr ([192.93.2.38]) by mail2-smtp-roc.national.inria.fr with ESMTP; 31 May 2008 18:54:57 +0200 Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by discorde.inria.fr (8.13.6/8.13.6) with ESMTP id m4VGsvZv007149 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=OK) for ; Sat, 31 May 2008 18:54:57 +0200 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AmoYAMMfQUhA6bjuZGdsb2JhbACCNjePOBoCBQQJEQOaYQ X-IronPort-AV: E=Sophos;i="4.27,571,1204498800"; d="scan'208";a="13307966" Received: from wr-out-0506.google.com ([64.233.184.238]) by mail3-smtp-sop.national.inria.fr with ESMTP; 31 May 2008 18:54:56 +0200 Received: by wr-out-0506.google.com with SMTP id c57so130558wra.9 for ; Sat, 31 May 2008 09:54:53 -0700 (PDT) Received: by 10.142.140.14 with SMTP id n14mr2716814wfd.192.1212252892255; Sat, 31 May 2008 09:54:52 -0700 (PDT) Received: by 10.143.165.3 with HTTP; Sat, 31 May 2008 09:54:52 -0700 (PDT) Message-ID: <28fa90930805310954w3089478bqfd8c3f821fff207e@mail.gmail.com> Date: Sat, 31 May 2008 09:54:52 -0700 From: "Luca de Alfaro" To: "Berke Durak" Subject: Re: [Caml-list] picking / marshaling to strings in ocaml-revision-stable way Cc: "Jacques Garrigue" , caml-list@inria.fr In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_2423_12423053.1212252892257" References: <28fa90930805302343n18b98b17t72a22ea82539fc6f@mail.gmail.com> <20080531.174314.107716495.garrigue@math.nagoya-u.ac.jp> X-Miltered: at discorde with ID 484182E1.000 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)! X-Spam: no; 0.00; marshaling:01 marshaling:01 wikipedia:01 ocaml:01 datatypes:01 ocamldebug:01 berke:01 durak:01 berke:01 durak:01 type-safe:01 segfault:01 segfault:01 hand-write:01 dependencies:01 ------=_Part_2423_12423053.1212252892257 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Thanks for this insight... I imagined the lack of robustness of Marshaling, but without all the details you mentioned!... actually, I DO desperately need speed, as I am processing TB's of Wikipedia data, but precisely because the datasets are so large, I cannot afford having to recompute / convert them often, and so I want a robust format. Furthermore, I think the bottleneck for me is anyway the speed of mysql and the disk, not really the small amount of time that natively compiled Ocaml would take for the conversion (I have anyway to do more complex computation that converting a few lists and datatypes to ascii, unfortunately). Moreover, a plaintext format greatly helps debugging; it also helps that I can read the same data with other programming languages. Speaking of debugging, and said in passing, I cannot say enough how much I LOVE the ability of ocamldebug of executing code backwards. It is such a revelation. You simply go to the error, then back off a bit to see how you got there. But, this is a topic for another thread. Many thanks, Luca On Sat, May 31, 2008 at 2:38 AM, Berke Durak wrote: > I second Luca's suggestion to use Sexplib. At the very least, use a > plaintext format. > Don't use Marshal for long-term storage of values. Avoid it if you > can. Been there, done that. > Why? > > (1) Not type-safe. Translation: your program *wil segfault* and you > won't know why. > (2) Not human-readable nor editable. > (3) Not future-proof. What happens if you change your type > definition? Your program > will segfault. So you'll have to migrate your data. But how? You'll > have to find > the exact revision used to generate the binary data. Good luck with > that. Did you put > a revision number in your data? Are you sure it was up-to-date? Then > you'll have to hand-write a converter that uses type declarations from > the old and the new modules. > I hope your dependencies are not too complex. Not fun *at all*. > > However, there are some situations where Marshal is appropriate : > > (1) Your data is not acyclic, contains closures, or needs sharing to > be compact enough. Sexplib doesn't handle these. > (2) The data won't live long anyway. As in: you're doing IPC between > known versions of Ocaml programs. > (3) You desperately need speed. As in: you're processing 200GB of > Wikipedia data. > Then I can understand. > -- > Berke Durak > ------=_Part_2423_12423053.1212252892257 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Thanks for this insight... I imagined the lack of robustness of Marshaling, but without all the details you mentioned!... actually, I DO desperately need speed, as I am processing TB's of Wikipedia data, but precisely because the datasets are so large, I cannot afford having to recompute / convert them often, and so I want a robust format. Furthermore, I think the bottleneck for me is anyway the speed of mysql and the disk, not really the small amount of time that natively compiled Ocaml would take for the conversion (I have anyway to do more complex computation that converting a few lists and datatypes to ascii, unfortunately).  Moreover, a plaintext format greatly helps debugging; it also helps that I can read the same data with other programming languages.

Speaking of debugging, and said in passing, I cannot say enough how much I LOVE the ability of ocamldebug of executing code backwards.  It is such a revelation.  You simply go to the error, then back off a bit to see how you got there.  But, this is a topic for another thread.

Many thanks,

Luca


On Sat, May 31, 2008 at 2:38 AM, Berke Durak <berke.durak@gmail.com> wrote:
I second Luca's suggestion to use Sexplib.  At the very least, use a
plaintext format.
Don't use Marshal for long-term storage of values.  Avoid it if you
can.  Been there, done that.
Why?

(1) Not type-safe.  Translation: your program *wil segfault* and you
won't know why.
(2) Not human-readable nor editable.
(3) Not future-proof.  What happens if you change your type
definition?  Your program
will segfault.  So you'll have to migrate your data.  But how?  You'll
have to find
the exact revision used to generate the binary data.  Good luck with
that.  Did you put
a revision number in your data?  Are you sure it was up-to-date?  Then
you'll have to hand-write a converter that uses type declarations from
the old and the new modules.
I hope your dependencies are not too complex.  Not fun *at all*.

However, there are some situations where Marshal is appropriate :

(1) Your data is not acyclic, contains closures, or needs sharing to
be compact enough.  Sexplib doesn't handle these.
(2) The data won't live long anyway.  As in: you're doing IPC between
known versions of Ocaml programs.
(3) You desperately need speed.  As in: you're processing 200GB of
Wikipedia data.
Then I can understand.
--
Berke Durak

------=_Part_2423_12423053.1212252892257--