Hello all,

This is my first post to the mailing list, I hope it will be relevant and in line with the posting guidelines. If not, please let me know and pardon my ignorance.

I am working on a large problem that I'm attempting to tackle in Ocaml. It requires me to optimize a function over a tree

- the tree has on the order of 100,000 to 1,000,000 nodes
- each node on the tree contains on the order of 100kB to 1MB of data

So total, I'm looking at 10GB to 1TB one data which is typically stored in a database on disk. The tree is loaded in memory using keys that represent entries in the database.

Another structure I'm dealing with is a key value store, whose size is on the same order of magnitude as the size of the data above. Initially, this key-value store is empty.

I am given an "apply" function that, for a given key-value store and for a given node in the tree produces a new key-value store. Typically by modifying a few keys in the key-value store. The amount of bits changed is on the same order of magnitude as the size of the node. When I reach a leaf of the tree, I can compute the optimality of that leaf from the state of the key-value store. I want to find the best leaf.

Here's my problem...
If I write the following in a imperative fashion, it seems straightforward enough. I should be doing a DFS over my tree, mutating my key=value store as I go, and saving on disk "backup" information to backtrack through the tree. This doesn't feel very elegant. I've been considering two solutions

1) since there is a one to one mapping between nodes in the tree and key-value stores, I could make my key-value store type to actually be the same as a node in the tree. I would put it in a module along with a reference to the node representing which key-value store is being represented on disk. The "apply" function would be straightforward and lazy, but whenever the keys of a key-value store are to be accessed, the following would happen

a) Check if the key-value store on disk is the same as the one requested, if so read the key
b) If that's not the case, find the least common ancestor in the tree, migrate the disk cache to that version then down to the one requested. This would require me to have an "undo" function to complement the "apply" function, which is an acceptable compromise.

2) the second solution would be to represent the key-value store as a binary tree. Since each operation changes only a few keys, I wouldn't be making a dumb copy of the entire key-value store at each step, only of the keys that matter. The problem is that though that works great in memory, I'm not sure how to make it work with an on-disk storage, a necessity given the size of it.

Thanks,
Arthurto