caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Keeping A big data optimization problem functional
@ 2014-04-14  5:25 Arthur Breitman
  2014-04-14  5:43 ` Francois Berenger
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Arthur Breitman @ 2014-04-14  5:25 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 2720 bytes --]

Hello all,

This is my first post to the mailing list, I hope it will be relevant and
in line with the posting guidelines. If not, please let me know and pardon
my ignorance.

I am working on a large problem that I'm attempting to tackle in Ocaml. It
requires me to optimize a function over a tree

- the tree has on the order of 100,000 to 1,000,000 nodes
- each node on the tree contains on the order of 100kB to 1MB of data

So total, I'm looking at 10GB to 1TB one data which is typically stored in
a database on disk. The tree is loaded in memory using keys that represent
entries in the database.

Another structure I'm dealing with is a key value store, whose size is on
the same order of magnitude as the size of the data above. Initially, this
key-value store is empty.

I am given an "apply" function that, for a given key-value store and for a
given node in the tree produces a new key-value store. Typically by
modifying a few keys in the key-value store. The amount of bits changed is
on the same order of magnitude as the size of the node. When I reach a leaf
of the tree, I can compute the optimality of that leaf from the state of
the key-value store. I want to find the best leaf.

Here's my problem...
If I write the following in a imperative fashion, it seems straightforward
enough. I should be doing a DFS over my tree, mutating my key=value store
as I go, and saving on disk "backup" information to backtrack through the
tree. This doesn't feel very elegant. I've been considering two solutions

1) since there is a one to one mapping between nodes in the tree and
key-value stores, I could make my key-value store type to actually be the
same as a node in the tree. I would put it in a module along with a
reference to the node representing which key-value store is being
represented on disk. The "apply" function would be straightforward and
lazy, but whenever the keys of a key-value store are to be accessed, the
following would happen

a) Check if the key-value store on disk is the same as the one requested,
if so read the key
b) If that's not the case, find the least common ancestor in the tree,
migrate the disk cache to that version then down to the one requested. This
would require me to have an "undo" function to complement the "apply"
function, which is an acceptable compromise.

2) the second solution would be to represent the key-value store as a
binary tree. Since each operation changes only a few keys, I wouldn't be
making a dumb copy of the entire key-value store at each step, only of the
keys that matter. The problem is that though that works great in memory,
I'm not sure how to make it work with an on-disk storage, a necessity given
the size of it.

Thanks,
Arthurto

[-- Attachment #2: Type: text/html, Size: 3168 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-04-16 11:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-14  5:25 [Caml-list] Keeping A big data optimization problem functional Arthur Breitman
2014-04-14  5:43 ` Francois Berenger
2014-04-14  6:07   ` Arthur Breitman
2014-04-14  8:12 ` Goswin von Brederlow
2014-04-14 19:32   ` Arthur Breitman
2014-04-15  9:05     ` Goswin von Brederlow
2014-04-15 13:16       ` Arthur Breitman
2014-04-15 13:50     ` Thomas Gazagnaire
2014-04-16 11:02       ` Arthur Breitman
2014-04-14 20:38 ` Richard W.M. Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).