From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by sympa.inria.fr (Postfix) with ESMTPS id 29AA37F720 for ; Mon, 14 Apr 2014 07:25:40 +0200 (CEST) Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of arthur.breitman@gmail.com) identity=pra; client-ip=209.85.215.54; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="arthur.breitman@gmail.com"; x-sender="arthur.breitman@gmail.com"; x-conformance=sidf_compatible Received-SPF: Pass (mail3-smtp-sop.national.inria.fr: domain of arthur.breitman@gmail.com designates 209.85.215.54 as permitted sender) identity=mailfrom; client-ip=209.85.215.54; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="arthur.breitman@gmail.com"; x-sender="arthur.breitman@gmail.com"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of postmaster@mail-la0-f54.google.com) identity=helo; client-ip=209.85.215.54; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="arthur.breitman@gmail.com"; x-sender="postmaster@mail-la0-f54.google.com"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AoMFAAFxS1PRVdc2Ymdsb2JhbABagwaBEoMQv30EgRsIFgMWCwsJFSeCTx0BGxIMAxIQNwIkAREBBQFQh0wBAxGZSoMPjA5Rgw6XHAoZJw1khX8RAQUMiTqEPSKDP4FJBIlcixiDbZBUGCmEfh6BLQ X-IPAS-Result: AoMFAAFxS1PRVdc2Ymdsb2JhbABagwaBEoMQv30EgRsIFgMWCwsJFSeCTx0BGxIMAxIQNwIkAREBBQFQh0wBAxGZSoMPjA5Rgw6XHAoZJw1khX8RAQUMiTqEPSKDP4FJBIlcixiDbZBUGCmEfh6BLQ X-IronPort-AV: E=Sophos;i="4.97,854,1389740400"; d="scan'208";a="56983023" Received: from mail-la0-f54.google.com ([209.85.215.54]) by mail3-smtp-sop.national.inria.fr with ESMTP; 14 Apr 2014 07:25:39 +0200 Received: by mail-la0-f54.google.com with SMTP id mc6so5324100lab.13 for ; Sun, 13 Apr 2014 22:25:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=Q0j+wOJlqHmM22tOWAR52OehPN7AlswBAgZMj2Ym3jc=; b=G3tppwQR/d2PfjWZxJj2KzFoFWoP6x6e2gXsY/WhRW4UeTJ9oP993AvYZZPlaeoMnS Fj7Vag9QWgJYjqbHvuU14IOiqSxWyJVNzZsGq08q4kwspuAO4YwZwQctOSrahjNbIp0B wHjUQOhGLIAcHGJ+hKBDydmpdeQp9s0/b+TaVl2mq3QIu+0U0/Bn3M/nMxN67QdrTuPP ofOoOPyERYNYYHU6osIjRWLCLOXD1fw0D7YXRMKSJ0pKp7md3+H/hXM43l+acKbq/qO4 cjhKI1gBQekjuVKAyofH2NCCYS8XBeivU53+doW5M+dqHIextTS6ygNYbAHVojC9TswM YQBQ== X-Received: by 10.112.47.3 with SMTP id z3mr555567lbm.34.1397453138502; Sun, 13 Apr 2014 22:25:38 -0700 (PDT) MIME-Version: 1.0 Received: by 10.114.158.130 with HTTP; Sun, 13 Apr 2014 22:25:18 -0700 (PDT) From: Arthur Breitman Date: Sun, 13 Apr 2014 23:25:18 -0600 Message-ID: To: caml-list@inria.fr Content-Type: multipart/alternative; boundary=001a1133ab62e92a9904f6f9e9ab Subject: [Caml-list] Keeping A big data optimization problem functional --001a1133ab62e92a9904f6f9e9ab Content-Type: text/plain; charset=UTF-8 Hello all, This is my first post to the mailing list, I hope it will be relevant and in line with the posting guidelines. If not, please let me know and pardon my ignorance. I am working on a large problem that I'm attempting to tackle in Ocaml. It requires me to optimize a function over a tree - the tree has on the order of 100,000 to 1,000,000 nodes - each node on the tree contains on the order of 100kB to 1MB of data So total, I'm looking at 10GB to 1TB one data which is typically stored in a database on disk. The tree is loaded in memory using keys that represent entries in the database. Another structure I'm dealing with is a key value store, whose size is on the same order of magnitude as the size of the data above. Initially, this key-value store is empty. I am given an "apply" function that, for a given key-value store and for a given node in the tree produces a new key-value store. Typically by modifying a few keys in the key-value store. The amount of bits changed is on the same order of magnitude as the size of the node. When I reach a leaf of the tree, I can compute the optimality of that leaf from the state of the key-value store. I want to find the best leaf. Here's my problem... If I write the following in a imperative fashion, it seems straightforward enough. I should be doing a DFS over my tree, mutating my key=value store as I go, and saving on disk "backup" information to backtrack through the tree. This doesn't feel very elegant. I've been considering two solutions 1) since there is a one to one mapping between nodes in the tree and key-value stores, I could make my key-value store type to actually be the same as a node in the tree. I would put it in a module along with a reference to the node representing which key-value store is being represented on disk. The "apply" function would be straightforward and lazy, but whenever the keys of a key-value store are to be accessed, the following would happen a) Check if the key-value store on disk is the same as the one requested, if so read the key b) If that's not the case, find the least common ancestor in the tree, migrate the disk cache to that version then down to the one requested. This would require me to have an "undo" function to complement the "apply" function, which is an acceptable compromise. 2) the second solution would be to represent the key-value store as a binary tree. Since each operation changes only a few keys, I wouldn't be making a dumb copy of the entire key-value store at each step, only of the keys that matter. The problem is that though that works great in memory, I'm not sure how to make it work with an on-disk storage, a necessity given the size of it. Thanks, Arthurto --001a1133ab62e92a9904f6f9e9ab Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hello all,

This is my first post to the= mailing list, I hope it will be relevant and in line with the posting guid= elines. If not, please let me know and pardon my ignorance.

I am working on a large problem that I'm attempting to tackl= e in Ocaml. It requires me to optimize a function over a tree
- the tree has on the order of 100,000 to 1,000,000 nodes
- each node on the tree contains on the order of 100kB to 1MB of data<= /div>

So total, I'm looking at 10GB to 1TB one data = which is typically stored in a database on disk. The tree is loaded in memo= ry using keys that represent entries in the database.

Another structure I'm dealing with is a key value s= tore, whose size is on the same order of magnitude as the size of the data = above. Initially, this key-value store is empty.

I am given an "apply" function that, for a given key-value store = and for a given node in the tree produces a new key-value store. Typically = by modifying a few keys in the key-value store. The amount of bits changed = is on the same order of magnitude as the size of the node. When I reach a l= eaf of the tree, I can compute the optimality of that leaf from the state o= f the key-value store. I want to find the best leaf.

Here's my problem...
If I write the follo= wing in a imperative fashion, it seems straightforward enough. I should be = doing a DFS over my tree, mutating my key=3Dvalue store as I go, and saving= on disk "backup" information to backtrack through the tree. This= doesn't feel very elegant. I've been considering two solutions

1) since there is a one to one mapping between nodes in= the tree and key-value stores, I could make my key-value store type to act= ually be the same as a node in the tree. I would put it in a module along w= ith a reference to the node representing which key-value store is being rep= resented on disk. The "apply" function would be straightforward a= nd lazy, but whenever the keys of a key-value store are to be accessed, the= following would happen

a) Check if the key-value store on disk is the same as = the one requested, if so read the key
b) If that's not the ca= se, find the least common ancestor in the tree, migrate the disk cache to t= hat version then down to the one requested. This would require me to have a= n "undo" function to complement the "apply" function, w= hich is an acceptable compromise.

2) the second solution would be to represent the key-va= lue store as a binary tree. Since each operation changes only a few keys, I= wouldn't be making a dumb copy of the entire key-value store at each s= tep, only of the keys that matter. The problem is that though that works gr= eat in memory, I'm not sure how to make it work with an on-disk storage= , a necessity given the size of it.

Thanks,
Arthurto
--001a1133ab62e92a9904f6f9e9ab--