caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Goswin von Brederlow <goswin-v-b@web.de>
To: caml-list@inria.fr
Subject: Re: [Caml-list] Keeping A big data optimization problem functional
Date: Mon, 14 Apr 2014 10:12:14 +0200	[thread overview]
Message-ID: <20140414081214.GD25532@frosties> (raw)
In-Reply-To: <CAAYUt0P4AzcYPbNLYsHRpCnBGa=oJT3MRfKwZv-wE4vv-JXNkw@mail.gmail.com>

On Sun, Apr 13, 2014 at 11:25:18PM -0600, Arthur Breitman wrote:
> Hello all,
> 
> This is my first post to the mailing list, I hope it will be relevant and
> in line with the posting guidelines. If not, please let me know and pardon
> my ignorance.
> 
> I am working on a large problem that I'm attempting to tackle in Ocaml. It
> requires me to optimize a function over a tree
> 
> - the tree has on the order of 100,000 to 1,000,000 nodes
> - each node on the tree contains on the order of 100kB to 1MB of data
> 
> So total, I'm looking at 10GB to 1TB one data which is typically stored in
> a database on disk. The tree is loaded in memory using keys that represent
> entries in the database.
> 
> Another structure I'm dealing with is a key value store, whose size is on
> the same order of magnitude as the size of the data above. Initially, this
> key-value store is empty.

You probably don't expect to keep 1TB data in ram. Does the key-value
store just reference values in the DB?
 
> I am given an "apply" function that, for a given key-value store and for a
> given node in the tree produces a new key-value store. Typically by
> modifying a few keys in the key-value store. The amount of bits changed is
> on the same order of magnitude as the size of the node. When I reach a leaf
> of the tree, I can compute the optimality of that leaf from the state of
> the key-value store. I want to find the best leaf.

Not sure what you mean there. Maybe you could give an example with a
small data set (maybe 10 keys).
 
> Here's my problem...
> If I write the following in a imperative fashion, it seems straightforward
> enough. I should be doing a DFS over my tree, mutating my key=value store
> as I go, and saving on disk "backup" information to backtrack through the
> tree. This doesn't feel very elegant. I've been considering two solutions
> 
> 1) since there is a one to one mapping between nodes in the tree and
> key-value stores, I could make my key-value store type to actually be the
> same as a node in the tree. I would put it in a module along with a
> reference to the node representing which key-value store is being
> represented on disk. The "apply" function would be straightforward and
> lazy, but whenever the keys of a key-value store are to be accessed, the
> following would happen
> 
> a) Check if the key-value store on disk is the same as the one requested,
> if so read the key
> b) If that's not the case, find the least common ancestor in the tree,
> migrate the disk cache to that version then down to the one requested. This
> would require me to have an "undo" function to complement the "apply"
> function, which is an acceptable compromise.
> 
> 2) the second solution would be to represent the key-value store as a
> binary tree. Since each operation changes only a few keys, I wouldn't be
> making a dumb copy of the entire key-value store at each step, only of the
> keys that matter. The problem is that though that works great in memory,
> I'm not sure how to make it work with an on-disk storage, a necessity given
> the size of it.
> 
> Thanks,
> Arthurto

Wouldn't a simple Map work as your key-value store? The apply function
would then modify the map, creating a new map that shares most of the
unchanged data with the original map. That way keeping the results of
many apply functions in memory should be possible and applying should
be fast.

But if memory becomes a problem then you probably want to use some
form of B-Tree to.

MfG
	Goswin

  parent reply	other threads:[~2014-04-14  8:12 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-14  5:25 Arthur Breitman
2014-04-14  5:43 ` Francois Berenger
2014-04-14  6:07   ` Arthur Breitman
2014-04-14  8:12 ` Goswin von Brederlow [this message]
2014-04-14 19:32   ` Arthur Breitman
2014-04-15  9:05     ` Goswin von Brederlow
2014-04-15 13:16       ` Arthur Breitman
2014-04-15 13:50     ` Thomas Gazagnaire
2014-04-16 11:02       ` Arthur Breitman
2014-04-14 20:38 ` Richard W.M. Jones

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140414081214.GD25532@frosties \
    --to=goswin-v-b@web.de \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).