From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id 792577F720 for ; Mon, 14 Apr 2014 07:43:52 +0200 (CEST) Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of berenger@riken.jp) identity=pra; client-ip=134.160.33.161; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="berenger@riken.jp"; x-sender="berenger@riken.jp"; x-conformance=sidf_compatible Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of berenger@riken.jp designates 134.160.33.161 as permitted sender) identity=mailfrom; client-ip=134.160.33.161; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="berenger@riken.jp"; x-sender="berenger@riken.jp"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of postmaster@postman.riken.jp designates 134.160.33.161 as permitted sender) identity=helo; client-ip=134.160.33.161; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="berenger@riken.jp"; x-sender="postmaster@postman.riken.jp"; x-conformance=sidf_compatible; x-record-type="v=spf1" X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AucBACt1S1OGoCGhnGdsb2JhbABahyjAA4E3DgEBAQEBBg0JCRQogiUBAQEEIw8BBTYKEQsYAgIFFgsCAgkDAgECAUUTCAEBF4dhqGmiNBeBKYgdhD0iUIJvgUkBA4laixqDbYZUjy+BWw X-IPAS-Result: AucBACt1S1OGoCGhnGdsb2JhbABahyjAA4E3DgEBAQEBBg0JCRQogiUBAQEEIw8BBTYKEQsYAgIFFgsCAgkDAgECAUUTCAEBF4dhqGmiNBeBKYgdhD0iUIJvgUkBA4laixqDbYZUjy+BWw X-IronPort-AV: E=Sophos;i="4.97,854,1389740400"; d="scan'208";a="68058806" Received: from postman1.riken.jp (HELO postman.riken.jp) ([134.160.33.161]) by mail2-smtp-roc.national.inria.fr with ESMTP; 14 Apr 2014 07:43:50 +0200 Received: from postman.riken.jp (postman1.riken.jp [127.0.0.1]) by postman.riken.jp (Postfix) with SMTP id D5E362588002 for ; Mon, 14 Apr 2014 14:43:46 +0900 (JST) Received: from [10.64.65.124] (ipm04.gsc.riken.go.jp [134.160.83.74]) by postman.riken.jp (Postfix) with ESMTPA id A505C32A0092 for ; Mon, 14 Apr 2014 14:43:46 +0900 (JST) Message-ID: <534B7587.6090301@riken.jp> Date: Mon, 14 Apr 2014 14:43:35 +0900 From: Francois Berenger User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: caml-list@inria.fr References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-PMX-Version: 6.0.0.2142326, Antispam-Engine: 2.7.2.2107409, Antispam-Data: 2014.4.14.53622 Subject: Re: [Caml-list] Keeping A big data optimization problem functional And that's when I realize that "Oh! No! We don't have bindings to Berkeley DB in OPAM!". That's unfortunate. :( On 2014年04月14日 14:25, Arthur Breitman wrote: > Hello all, > > This is my first post to the mailing list, I hope it will be relevant and > in line with the posting guidelines. If not, please let me know and pardon > my ignorance. > > I am working on a large problem that I'm attempting to tackle in Ocaml. It > requires me to optimize a function over a tree > > - the tree has on the order of 100,000 to 1,000,000 nodes > - each node on the tree contains on the order of 100kB to 1MB of data > > So total, I'm looking at 10GB to 1TB one data which is typically stored in > a database on disk. The tree is loaded in memory using keys that represent > entries in the database. > > Another structure I'm dealing with is a key value store, whose size is on > the same order of magnitude as the size of the data above. Initially, this > key-value store is empty. > > I am given an "apply" function that, for a given key-value store and for a > given node in the tree produces a new key-value store. Typically by > modifying a few keys in the key-value store. The amount of bits changed is > on the same order of magnitude as the size of the node. When I reach a leaf > of the tree, I can compute the optimality of that leaf from the state of > the key-value store. I want to find the best leaf. > > Here's my problem... > If I write the following in a imperative fashion, it seems straightforward > enough. I should be doing a DFS over my tree, mutating my key=value store > as I go, and saving on disk "backup" information to backtrack through the > tree. This doesn't feel very elegant. I've been considering two solutions > > 1) since there is a one to one mapping between nodes in the tree and > key-value stores, I could make my key-value store type to actually be the > same as a node in the tree. I would put it in a module along with a > reference to the node representing which key-value store is being > represented on disk. The "apply" function would be straightforward and > lazy, but whenever the keys of a key-value store are to be accessed, the > following would happen > > a) Check if the key-value store on disk is the same as the one requested, > if so read the key > b) If that's not the case, find the least common ancestor in the tree, > migrate the disk cache to that version then down to the one requested. This > would require me to have an "undo" function to complement the "apply" > function, which is an acceptable compromise. > > 2) the second solution would be to represent the key-value store as a > binary tree. Since each operation changes only a few keys, I wouldn't be > making a dumb copy of the entire key-value store at each step, only of the > keys that matter. The problem is that though that works great in memory, > I'm not sure how to make it work with an on-disk storage, a necessity given > the size of it. > > Thanks, > Arthurto > -- Best regards, Francois Berenger.