From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id A571A7F720 for ; Mon, 14 Apr 2014 10:12:17 +0200 (CEST) Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of goswin-v-b@web.de) identity=pra; client-ip=212.227.17.11; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="goswin-v-b@web.de"; x-sender="goswin-v-b@web.de"; x-conformance=sidf_compatible Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of goswin-v-b@web.de) identity=mailfrom; client-ip=212.227.17.11; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="goswin-v-b@web.de"; x-sender="goswin-v-b@web.de"; x-conformance=sidf_compatible Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of postmaster@mout.web.de) identity=helo; client-ip=212.227.17.11; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="goswin-v-b@web.de"; x-sender="postmaster@mout.web.de"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AsQBADGXS1PU4xELnGdsb2JhbABawVSFXYEgFg4BAQEBAQYNCQkUKIIlAQEBAwE6NBALCxgJJQ8FKE+HTAEMDMM5H4cbF4lGhD0iUIMkgRQElHSDbIY+Eo8ngWg X-IPAS-Result: AsQBADGXS1PU4xELnGdsb2JhbABawVSFXYEgFg4BAQEBAQYNCQkUKIIlAQEBAwE6NBALCxgJJQ8FKE+HTAEMDMM5H4cbF4lGhD0iUIMkgRQElHSDbIY+Eo8ngWg X-IronPort-AV: E=Sophos;i="4.97,855,1389740400"; d="scan'208";a="68090674" Received: from mout.web.de ([212.227.17.11]) by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-SHA; 14 Apr 2014 10:12:17 +0200 Received: from frosties.localnet ([78.43.112.216]) by smtp.web.de (mrweb103) with ESMTPSA (Nemesis) id 0LzKEH-1X45JJ0gkb-014QEx for ; Mon, 14 Apr 2014 10:12:16 +0200 Received: from mrvn by frosties.localnet with local (Exim 4.80) (envelope-from ) id 1WZbzy-0006rO-Nu for caml-list@inria.fr; Mon, 14 Apr 2014 10:12:14 +0200 Date: Mon, 14 Apr 2014 10:12:14 +0200 From: Goswin von Brederlow To: caml-list@inria.fr Message-ID: <20140414081214.GD25532@frosties> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Provags-ID: V03:K0:MEvH7fPF0mUyowbHYR40AsAu6Y4aOhvmsrYOVgibIEIXJcxGiGJ D2dZn5nAc2Q84lBAYeene0sinGxRs5frEOus7nw8zJ16pxgKKWgmfNkAIVf/u7qe5rLvR1i 5FRnZiGNHJkYbVuk/QDXGrfVEzpIecghSljP13vqFBzaN4g+mruzcaqkB1FtixvNoYSybMB irzG47e6Up9jYv4FVuQeA== Subject: Re: [Caml-list] Keeping A big data optimization problem functional On Sun, Apr 13, 2014 at 11:25:18PM -0600, Arthur Breitman wrote: > Hello all, > > This is my first post to the mailing list, I hope it will be relevant and > in line with the posting guidelines. If not, please let me know and pardon > my ignorance. > > I am working on a large problem that I'm attempting to tackle in Ocaml. It > requires me to optimize a function over a tree > > - the tree has on the order of 100,000 to 1,000,000 nodes > - each node on the tree contains on the order of 100kB to 1MB of data > > So total, I'm looking at 10GB to 1TB one data which is typically stored in > a database on disk. The tree is loaded in memory using keys that represent > entries in the database. > > Another structure I'm dealing with is a key value store, whose size is on > the same order of magnitude as the size of the data above. Initially, this > key-value store is empty. You probably don't expect to keep 1TB data in ram. Does the key-value store just reference values in the DB? > I am given an "apply" function that, for a given key-value store and for a > given node in the tree produces a new key-value store. Typically by > modifying a few keys in the key-value store. The amount of bits changed is > on the same order of magnitude as the size of the node. When I reach a leaf > of the tree, I can compute the optimality of that leaf from the state of > the key-value store. I want to find the best leaf. Not sure what you mean there. Maybe you could give an example with a small data set (maybe 10 keys). > Here's my problem... > If I write the following in a imperative fashion, it seems straightforward > enough. I should be doing a DFS over my tree, mutating my key=value store > as I go, and saving on disk "backup" information to backtrack through the > tree. This doesn't feel very elegant. I've been considering two solutions > > 1) since there is a one to one mapping between nodes in the tree and > key-value stores, I could make my key-value store type to actually be the > same as a node in the tree. I would put it in a module along with a > reference to the node representing which key-value store is being > represented on disk. The "apply" function would be straightforward and > lazy, but whenever the keys of a key-value store are to be accessed, the > following would happen > > a) Check if the key-value store on disk is the same as the one requested, > if so read the key > b) If that's not the case, find the least common ancestor in the tree, > migrate the disk cache to that version then down to the one requested. This > would require me to have an "undo" function to complement the "apply" > function, which is an acceptable compromise. > > 2) the second solution would be to represent the key-value store as a > binary tree. Since each operation changes only a few keys, I wouldn't be > making a dumb copy of the entire key-value store at each step, only of the > keys that matter. The problem is that though that works great in memory, > I'm not sure how to make it work with an on-disk storage, a necessity given > the size of it. > > Thanks, > Arthurto Wouldn't a simple Map work as your key-value store? The apply function would then modify the map, creating a new map that shares most of the unchanged data with the original map. That way keeping the results of many apply functions in memory should be possible and applying should be fast. But if memory becomes a problem then you probably want to use some form of B-Tree to. MfG Goswin