From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id CADD97F720 for ; Mon, 14 Apr 2014 08:08:04 +0200 (CEST) Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of arthur.breitman@gmail.com) identity=pra; client-ip=209.85.215.53; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="arthur.breitman@gmail.com"; x-sender="arthur.breitman@gmail.com"; x-conformance=sidf_compatible Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of arthur.breitman@gmail.com designates 209.85.215.53 as permitted sender) identity=mailfrom; client-ip=209.85.215.53; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="arthur.breitman@gmail.com"; x-sender="arthur.breitman@gmail.com"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of postmaster@mail-la0-f53.google.com) identity=helo; client-ip=209.85.215.53; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="arthur.breitman@gmail.com"; x-sender="postmaster@mail-la0-f53.google.com"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApADABJ7S1PRVdc1Y2dsb2JhbABagwY7V4MQqQWODIh0gRYIFgMWDQkJFSeCJQEBAQMBIx0BGxILAQMBCwYFCw0NHQICIgERAQUBChIGEwgKCYdMAQMJCA2cQowOUYMOlxkKGScDCmSFfxEBBQyJOoQ9IkUEB4JvgUkEiVyLGINtgTaPHhgpgm6CEB6BLQ X-IPAS-Result: ApADABJ7S1PRVdc1Y2dsb2JhbABagwY7V4MQqQWODIh0gRYIFgMWDQkJFSeCJQEBAQMBIx0BGxILAQMBCwYFCw0NHQICIgERAQUBChIGEwgKCYdMAQMJCA2cQowOUYMOlxkKGScDCmSFfxEBBQyJOoQ9IkUEB4JvgUkEiVyLGINtgTaPHhgpgm6CEB6BLQ X-IronPort-AV: E=Sophos;i="4.97,855,1389740400"; d="scan'208";a="68062258" Received: from mail-la0-f53.google.com ([209.85.215.53]) by mail2-smtp-roc.national.inria.fr with ESMTP; 14 Apr 2014 08:07:50 +0200 Received: by mail-la0-f53.google.com with SMTP id b8so5362021lan.12 for ; Sun, 13 Apr 2014 23:07:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=5BEycZwf/gm6u8qjhNN5H3IxR5tzZuFoMs8scsNd8io=; b=nmeMQkQlZJrJTxTVMkOJOWCoHpDJQ2dvXIpZPMNi9/MluUqTN6kW8KAxwGFsurZZCt GCPs0WHSIl/msmsW+utQrItWhPqMEk6MjjIQigk/+J+OBZyNNlrXjgdWwSg9PchXchVR 0HGYgwDaFWC9yYdBBMiRnDepw4fj7G80CtILumrBfbBHJDAoNMocDVEA33P9qGskXd0p p8GvmANUhX4ePYOJaJPdUOeEzFsBvBF2hfqLwaJQMRa2WqMsWEPs8ypzhBbWzk+IOePO dTSLGuIXVQZsZ9BgfMcr73E3KNSUIUYn3ldY3JvkTBbcIf5fsQHg1MSQdausFyn3kNUN rsGg== X-Received: by 10.152.7.97 with SMTP id i1mr429946laa.36.1397455670121; Sun, 13 Apr 2014 23:07:50 -0700 (PDT) MIME-Version: 1.0 Received: by 10.114.158.130 with HTTP; Sun, 13 Apr 2014 23:07:30 -0700 (PDT) In-Reply-To: <534B7587.6090301@riken.jp> References: <534B7587.6090301@riken.jp> From: Arthur Breitman Date: Mon, 14 Apr 2014 00:07:30 -0600 Message-ID: To: Francois Berenger Cc: caml-list Content-Type: multipart/alternative; boundary=001a11c27b7ace99ec04f6fa80ce Subject: Re: [Caml-list] Keeping A big data optimization problem functional --001a11c27b7ace99ec04f6fa80ce Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable No but it does have bindings for leveldb which does just what I want. The question is, how do I hide those dirty db mutations. On Sun, Apr 13, 2014 at 11:43 PM, Francois Berenger wrot= e: > And that's when I realize that "Oh! No! We don't have bindings to Berkeley > DB in OPAM!". That's unfortunate. :( > > > On 2014=E5=B9=B404=E6=9C=8814=E6=97=A5 14:25, Arthur Breitman wrote: > >> Hello all, >> >> This is my first post to the mailing list, I hope it will be relevant and >> in line with the posting guidelines. If not, please let me know and pard= on >> my ignorance. >> >> I am working on a large problem that I'm attempting to tackle in Ocaml. = It >> requires me to optimize a function over a tree >> >> - the tree has on the order of 100,000 to 1,000,000 nodes >> - each node on the tree contains on the order of 100kB to 1MB of data >> >> So total, I'm looking at 10GB to 1TB one data which is typically stored = in >> a database on disk. The tree is loaded in memory using keys that represe= nt >> entries in the database. >> >> Another structure I'm dealing with is a key value store, whose size is on >> the same order of magnitude as the size of the data above. Initially, th= is >> key-value store is empty. >> >> I am given an "apply" function that, for a given key-value store and for= a >> given node in the tree produces a new key-value store. Typically by >> modifying a few keys in the key-value store. The amount of bits changed = is >> on the same order of magnitude as the size of the node. When I reach a >> leaf >> of the tree, I can compute the optimality of that leaf from the state of >> the key-value store. I want to find the best leaf. >> >> Here's my problem... >> If I write the following in a imperative fashion, it seems straightforwa= rd >> enough. I should be doing a DFS over my tree, mutating my key=3Dvalue st= ore >> as I go, and saving on disk "backup" information to backtrack through the >> tree. This doesn't feel very elegant. I've been considering two solutions >> >> 1) since there is a one to one mapping between nodes in the tree and >> key-value stores, I could make my key-value store type to actually be the >> same as a node in the tree. I would put it in a module along with a >> reference to the node representing which key-value store is being >> represented on disk. The "apply" function would be straightforward and >> lazy, but whenever the keys of a key-value store are to be accessed, the >> following would happen >> >> a) Check if the key-value store on disk is the same as the one requested, >> if so read the key >> b) If that's not the case, find the least common ancestor in the tree, >> migrate the disk cache to that version then down to the one requested. >> This >> would require me to have an "undo" function to complement the "apply" >> function, which is an acceptable compromise. >> >> 2) the second solution would be to represent the key-value store as a >> binary tree. Since each operation changes only a few keys, I wouldn't be >> making a dumb copy of the entire key-value store at each step, only of t= he >> keys that matter. The problem is that though that works great in memory, >> I'm not sure how to make it work with an on-disk storage, a necessity >> given >> the size of it. >> >> Thanks, >> Arthurto >> >> > > -- > Best regards, > Francois Berenger. > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > --=20 Arthur Breitman --001a11c27b7ace99ec04f6fa80ce Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
No but it does have bindings for leveldb which does just w= hat I want. The question is, how do I hide those dirty db mutations.
<= div class=3D"gmail_extra">

On Sun, Apr 13= , 2014 at 11:43 PM, Francois Berenger <berenger@riken.jp> wr= ote:
And that's when I realize that "Oh!= No! We don't have bindings to Berkeley DB in OPAM!". That's u= nfortunate. :(


On 2014=E5=B9=B404=E6=9C=8814=E6=97=A5 14:25, Arthur Breitman wrote:
Hello all,

This is my first post to the mailing list, I hope it will be relevant and in line with the posting guidelines. If not, please let me know and pardon<= br> my ignorance.

I am working on a large problem that I'm attempting to tackle in Ocaml.= It
requires me to optimize a function over a tree

- the tree has on the order of 100,000 to 1,000,000 nodes
- each node on the tree contains on the order of 100kB to 1MB of data

So total, I'm looking at 10GB to 1TB one data which is typically stored= in
a database on disk. The tree is loaded in memory using keys that represent<= br> entries in the database.

Another structure I'm dealing with is a key value store, whose size is = on
the same order of magnitude as the size of the data above. Initially, this<= br> key-value store is empty.

I am given an "apply" function that, for a given key-value store = and for a
given node in the tree produces a new key-value store. Typically by
modifying a few keys in the key-value store. The amount of bits changed is<= br> on the same order of magnitude as the size of the node. When I reach a leaf=
of the tree, I can compute the optimality of that leaf from the state of
the key-value store. I want to find the best leaf.

Here's my problem...
If I write the following in a imperative fashion, it seems straightforward<= br> enough. I should be doing a DFS over my tree, mutating my key=3Dvalue store=
as I go, and saving on disk "backup" information to backtrack thr= ough the
tree. This doesn't feel very elegant. I've been considering two sol= utions

1) since there is a one to one mapping between nodes in the tree and
key-value stores, I could make my key-value store type to actually be the same as a node in the tree. I would put it in a module along with a
reference to the node representing which key-value store is being
represented on disk. The "apply" function would be straightforwar= d and
lazy, but whenever the keys of a key-value store are to be accessed, the
following would happen

a) Check if the key-value store on disk is the same as the one requested, if so read the key
b) If that's not the case, find the least common ancestor in the tree,<= br> migrate the disk cache to that version then down to the one requested. This=
would require me to have an "undo" function to complement the &qu= ot;apply"
function, which is an acceptable compromise.

2) the second solution would be to represent the key-value store as a
binary tree. Since each operation changes only a few keys, I wouldn't b= e
making a dumb copy of the entire key-value store at each step, only of the<= br> keys that matter. The problem is that though that works great in memory,
I'm not sure how to make it work with an on-disk storage, a necessity g= iven
the size of it.

Thanks,
Arthurto



--
Best regards,
Francois Berenger.

--
Caml-list mailing list. =C2=A0Subscription management and archives:
ht= tps://sympa.inria.fr/sympa/arc/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners<= /a>
Bug reports:
http://caml.inria.fr/bin/caml-bugs



--
Arthur Breitman
--001a11c27b7ace99ec04f6fa80ce--