From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.3 required=5.0 tests=AWL,HTML_MESSAGE autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by yquem.inria.fr (Postfix) with ESMTP id 61651BBAF for ; Sat, 31 May 2008 19:07:26 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AkkBAN4hQUjRVcbodGdsb2JhbACCODWPRQEMAwQGBw8FlQaFUQ X-IronPort-AV: E=Sophos;i="4.27,571,1204498800"; d="scan'208,217";a="13308275" Received: from discorde.inria.fr ([192.93.2.38]) by mail3-smtp-sop.national.inria.fr with ESMTP; 31 May 2008 19:07:26 +0200 Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by discorde.inria.fr (8.13.6/8.13.6) with ESMTP id m4VH7PG8007506 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=OK) for ; Sat, 31 May 2008 19:07:25 +0200 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AkkBAN4hQUjRVcbodGdsb2JhbACCODWPRQEMAwQGBw8FlQaFUQ X-IronPort-AV: E=Sophos;i="4.27,571,1204498800"; d="scan'208,217";a="13308274" Received: from rv-out-0506.google.com ([209.85.198.232]) by mail3-smtp-sop.national.inria.fr with ESMTP; 31 May 2008 19:07:24 +0200 Received: by rv-out-0506.google.com with SMTP id k40so253124rvb.57 for ; Sat, 31 May 2008 10:07:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:from:to:in-reply-to:content-type:x-mailer:mime-version:subject:date:references; bh=g+iYLE3Sn2/evnPuPQdxxTL2p/p2nzKAbbl3VlvJbbY=; b=dFQiTeIJlN5dNv5+l/i7dzBK5oOCM3ZkEwzWNaJltpS6Veco6SFziUAgdk8Ros0x7Ctrp1JaS7ID2ABugXWO/8SX+/yermVajCXnGPl413mb9gI9eFApxNoO97x+HSeb1JC2/Fg5/klKWmzctuI5I4xPD896+oFtMYNkqOcJFk4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:from:to:in-reply-to:content-type:x-mailer:mime-version:subject:date:references; b=WnIifbtptRn5mWbPOCUlubprvXCWpozoSfDh2epo13R1UDVAt/+kUI2XgMMVf4FWq1zOxNfrGJOGhIrChApzo9xD9oW1ihQju6KOY2GNBjbh2F4Chtha7IMOrt6yiqWycAjfBvGnh9iksOPMwtc08Cww8rrMp71EHJdXnyYTeR0= Received: by 10.114.112.1 with SMTP id k1mr6840081wac.10.1212253641477; Sat, 31 May 2008 10:07:21 -0700 (PDT) Received: from ?10.176.62.203? ( [32.136.2.137]) by mx.google.com with ESMTPS id k14sm2118732waf.29.2008.05.31.10.07.17 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sat, 31 May 2008 10:07:20 -0700 (PDT) Message-Id: <2ED65D98-F1BA-4BD8-86A0-1B8A526F604A@gmail.com> From: Yaron Minsky To: Caml-list List In-Reply-To: <28fa90930805310954w3089478bqfd8c3f821fff207e@mail.gmail.com> Content-Type: multipart/alternative; boundary=Apple-Mail-1--129318833 X-Mailer: iPhone Mail (4A102) Mime-Version: 1.0 (iPhone Mail 4A102) Subject: Re: [Caml-list] picking / marshaling to strings in ocaml-revision-stable way Date: Sat, 31 May 2008 13:06:37 -0400 References: <28fa90930805302343n18b98b17t72a22ea82539fc6f@mail.gmail.com> <20080531.174314.107716495.garrigue@math.nagoya-u.ac.jp> <28fa90930805310954w3089478bqfd8c3f821fff207e@mail.gmail.com> X-Miltered: at discorde with ID 484185CD.000 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)! X-Spam: no; 0.00; yaron:01 minsky:01 yminsky:01 marshaling:01 yaron:01 minsky:01 marshaling:01 wikipedia:01 ocaml:01 datatypes:01 ocamldebug:01 berke:01 durak:01 berke:01 durak:01 X-Attachments: cset="UTF-8" --Apple-Mail-1--129318833 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes If you're willing to sacrifice readability for speed and compactness, you might want to consider jane street's bin-prot library as well... Yaron Minsky On May 31, 2008, at 12:54 PM, Luca de Alfaro wrote: > Thanks for this insight... I imagined the lack of robustness of > Marshaling, but without all the details you mentioned!... actually, > I DO desperately need speed, as I am processing TB's of Wikipedia > data, but precisely because the datasets are so large, I cannot > afford having to recompute / convert them often, and so I want a > robust format. Furthermore, I think the bottleneck for me is anyway > the speed of mysql and the disk, not really the small amount of time > that natively compiled Ocaml would take for the conversion (I have > anyway to do more complex computation that converting a few lists > and datatypes to ascii, unfortunately). Moreover, a plaintext > format greatly helps debugging; it also helps that I can read the > same data with other programming languages. > > Speaking of debugging, and said in passing, I cannot say enough how > much I LOVE the ability of ocamldebug of executing code backwards. > It is such a revelation. You simply go to the error, then back off > a bit to see how you got there. But, this is a topic for another > thread. > > Many thanks, > > Luca > > > On Sat, May 31, 2008 at 2:38 AM, Berke Durak > wrote: > I second Luca's suggestion to use Sexplib. At the very least, use a > plaintext format. > Don't use Marshal for long-term storage of values. Avoid it if you > can. Been there, done that. > Why? > > (1) Not type-safe. Translation: your program *wil segfault* and you > won't know why. > (2) Not human-readable nor editable. > (3) Not future-proof. What happens if you change your type > definition? Your program > will segfault. So you'll have to migrate your data. But how? You'll > have to find > the exact revision used to generate the binary data. Good luck with > that. Did you put > a revision number in your data? Are you sure it was up-to-date? Then > you'll have to hand-write a converter that uses type declarations from > the old and the new modules. > I hope your dependencies are not too complex. Not fun *at all*. > > However, there are some situations where Marshal is appropriate : > > (1) Your data is not acyclic, contains closures, or needs sharing to > be compact enough. Sexplib doesn't handle these. > (2) The data won't live long anyway. As in: you're doing IPC between > known versions of Ocaml programs. > (3) You desperately need speed. As in: you're processing 200GB of > Wikipedia data. > Then I can understand. > -- > Berke Durak > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs --Apple-Mail-1--129318833 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8
If you're willing to sacrifice = readability for speed and compactness, you might want to consider jane = street's bin-prot library as well...

Yaron = Minsky

On May 31, 2008, at 12:54 PM, Luca de Alfaro <luca@dealfaro.org> = wrote:

Thanks for = this insight... I imagined the lack of robustness of Marshaling, but = without all the details you mentioned!... actually, I DO desperately = need speed, as I am processing TB's of Wikipedia data, but precisely = because the datasets are so large, I cannot afford having to recompute / = convert them often, and so I want a robust format. Furthermore, I think = the bottleneck for me is anyway the speed of mysql and the disk, not = really the small amount of time that natively compiled Ocaml would take = for the conversion (I have anyway to do more complex computation that = converting a few lists and datatypes to ascii, unfortunately).  = Moreover, a plaintext format greatly helps debugging; it also helps that = I can read the same data with other programming languages.

Speaking of debugging, and said in passing, I cannot say enough how = much I LOVE the ability of ocamldebug of executing code backwards.  = It is such a revelation.  You simply go to the error, then back off = a bit to see how you got there.  But, this is a topic for another = thread.

Many thanks,

Luca


On = Sat, May 31, 2008 at 2:38 AM, Berke Durak <berke.durak@gmail.com> = wrote:
I second Luca's suggestion to use Sexplib.  At the very least, use = a
plaintext format.
Don't use Marshal for long-term storage of values.  Avoid it if = you
can.  Been there, done that.
Why?

(1) Not type-safe.  Translation: your program *wil segfault* and = you
won't know why.
(2) Not human-readable nor editable.
(3) Not future-proof.  What happens if you change your type
definition?  Your program
will segfault.  So you'll have to migrate your data.  But how? =  You'll
have to find
the exact revision used to generate the binary data.  Good luck = with
that.  Did you put
a revision number in your data?  Are you sure it was up-to-date? =  Then
you'll have to hand-write a converter that uses type declarations = from
the old and the new modules.
I hope your dependencies are not too complex.  Not fun *at = all*.

However, there are some situations where Marshal is appropriate :

(1) Your data is not acyclic, contains closures, or needs sharing to
be compact enough.  Sexplib doesn't handle these.
(2) The data won't live long anyway.  As in: you're doing IPC = between
known versions of Ocaml programs.
(3) You desperately need speed.  As in: you're processing 200GB = of
Wikipedia data.
Then I can understand.
--
Berke Durak

_______________________________________________
Caml-list mailing list. Subscription = management:
http://y= quem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archi= ves: http://caml.inria.fr
Beginner's list: http://groups.yahoo= .com/group/ocaml_beginners

Bug reports: http://caml.inria.fr/bin/caml-= bugs
= --Apple-Mail-1--129318833--