Re: [Caml-list] Keeping A big data optimization problem functional

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: Arthur Breitman <arthur.breitman@gmail.com>
To: Francois Berenger <berenger@riken.jp>
Cc: caml-list <caml-list@inria.fr>
Subject: Re: [Caml-list] Keeping A big data optimization problem functional
Date: Mon, 14 Apr 2014 00:07:30 -0600	[thread overview]
Message-ID: <CAAYUt0NLS+w-ZtS0JgnonzvkY4YcsigCq2fGvuf6dvWd5N6sPw@mail.gmail.com> (raw)
In-Reply-To: <534B7587.6090301@riken.jp>

[-- Attachment #1: Type: text/plain, Size: 3669 bytes --]

No but it does have bindings for leveldb which does just what I want. The
question is, how do I hide those dirty db mutations.


On Sun, Apr 13, 2014 at 11:43 PM, Francois Berenger <berenger@riken.jp>wrote:

> And that's when I realize that "Oh! No! We don't have bindings to Berkeley
> DB in OPAM!". That's unfortunate. :(
>
>
> On 2014年04月14日 14:25, Arthur Breitman wrote:
>
>> Hello all,
>>
>> This is my first post to the mailing list, I hope it will be relevant and
>> in line with the posting guidelines. If not, please let me know and pardon
>> my ignorance.
>>
>> I am working on a large problem that I'm attempting to tackle in Ocaml. It
>> requires me to optimize a function over a tree
>>
>> - the tree has on the order of 100,000 to 1,000,000 nodes
>> - each node on the tree contains on the order of 100kB to 1MB of data
>>
>> So total, I'm looking at 10GB to 1TB one data which is typically stored in
>> a database on disk. The tree is loaded in memory using keys that represent
>> entries in the database.
>>
>> Another structure I'm dealing with is a key value store, whose size is on
>> the same order of magnitude as the size of the data above. Initially, this
>> key-value store is empty.
>>
>> I am given an "apply" function that, for a given key-value store and for a
>> given node in the tree produces a new key-value store. Typically by
>> modifying a few keys in the key-value store. The amount of bits changed is
>> on the same order of magnitude as the size of the node. When I reach a
>> leaf
>> of the tree, I can compute the optimality of that leaf from the state of
>> the key-value store. I want to find the best leaf.
>>
>> Here's my problem...
>> If I write the following in a imperative fashion, it seems straightforward
>> enough. I should be doing a DFS over my tree, mutating my key=value store
>> as I go, and saving on disk "backup" information to backtrack through the
>> tree. This doesn't feel very elegant. I've been considering two solutions
>>
>> 1) since there is a one to one mapping between nodes in the tree and
>> key-value stores, I could make my key-value store type to actually be the
>> same as a node in the tree. I would put it in a module along with a
>> reference to the node representing which key-value store is being
>> represented on disk. The "apply" function would be straightforward and
>> lazy, but whenever the keys of a key-value store are to be accessed, the
>> following would happen
>>
>> a) Check if the key-value store on disk is the same as the one requested,
>> if so read the key
>> b) If that's not the case, find the least common ancestor in the tree,
>> migrate the disk cache to that version then down to the one requested.
>> This
>> would require me to have an "undo" function to complement the "apply"
>> function, which is an acceptable compromise.
>>
>> 2) the second solution would be to represent the key-value store as a
>> binary tree. Since each operation changes only a few keys, I wouldn't be
>> making a dumb copy of the entire key-value store at each step, only of the
>> keys that matter. The problem is that though that works great in memory,
>> I'm not sure how to make it work with an on-disk storage, a necessity
>> given
>> the size of it.
>>
>> Thanks,
>> Arthurto
>>
>>
>
> --
> Best regards,
> Francois Berenger.
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>



-- 
Arthur Breitman

[-- Attachment #2: Type: text/html, Size: 4656 bytes --]

next prev parent reply	other threads:[~2014-04-14  6:08 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-14  5:25 Arthur Breitman
2014-04-14  5:43 ` Francois Berenger
2014-04-14  6:07   ` Arthur Breitman [this message]
2014-04-14  8:12 ` Goswin von Brederlow
2014-04-14 19:32   ` Arthur Breitman
2014-04-15  9:05     ` Goswin von Brederlow
2014-04-15 13:16       ` Arthur Breitman
2014-04-15 13:50     ` Thomas Gazagnaire
2014-04-16 11:02       ` Arthur Breitman
2014-04-14 20:38 ` Richard W.M. Jones

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAAYUt0NLS+w-ZtS0JgnonzvkY4YcsigCq2fGvuf6dvWd5N6sPw@mail.gmail.com \
    --to=arthur.breitman@gmail.com \
    --cc=berenger@riken.jp \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).