From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by sympa.inria.fr (Postfix) with ESMTPS id A9AC77F720 for ; Tue, 15 Apr 2014 15:16:23 +0200 (CEST) Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of arthur.breitman@gmail.com) identity=pra; client-ip=209.85.215.53; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="arthur.breitman@gmail.com"; x-sender="arthur.breitman@gmail.com"; x-conformance=sidf_compatible Received-SPF: Pass (mail3-smtp-sop.national.inria.fr: domain of arthur.breitman@gmail.com designates 209.85.215.53 as permitted sender) identity=mailfrom; client-ip=209.85.215.53; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="arthur.breitman@gmail.com"; x-sender="arthur.breitman@gmail.com"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of postmaster@mail-la0-f53.google.com) identity=helo; client-ip=209.85.215.53; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="arthur.breitman@gmail.com"; x-sender="postmaster@mail-la0-f53.google.com"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqUCAC4wTVPRVdc1Y2dsb2JhbABZgwY7V4MQqRCOFIh0gRoIFgMWDQkJFSeCJQEBAQMBIwQZARsSCwEDDAYDAgsNDR0CAiIBEQEFAQoSBhMSh1UBAwkIDY0ZkAuMDlGDDpdZChknAwoVT4V/EQEFDI5TBAeCb4FJBIlcixmDbYE2jyAYKYR+Hg X-IPAS-Result: AqUCAC4wTVPRVdc1Y2dsb2JhbABZgwY7V4MQqRCOFIh0gRoIFgMWDQkJFSeCJQEBAQMBIwQZARsSCwEDDAYDAgsNDR0CAiIBEQEFAQoSBhMSh1UBAwkIDY0ZkAuMDlGDDpdZChknAwoVT4V/EQEFDI5TBAeCb4FJBIlcixmDbYE2jyAYKYR+Hg X-IronPort-AV: E=Sophos;i="4.97,864,1389740400"; d="scan'208";a="57224041" Received: from mail-la0-f53.google.com ([209.85.215.53]) by mail3-smtp-sop.national.inria.fr with ESMTP; 15 Apr 2014 15:16:22 +0200 Received: by mail-la0-f53.google.com with SMTP id b8so6772046lan.26 for ; Tue, 15 Apr 2014 06:16:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=/y7PsX8OVJGL4pP3AxmsUZ7IcEn/SvuRnb+q02aJSXM=; b=w98tFPIwWB6dOLbI04tj+fJrlHLzs65VUQfe2YFXy7DSEsJjFsHsc6cl11sx1rCigu D/Xcu6FUXyPL70fagq72kpBFrQPmaF5rFk4M2AAcSkqtr78fFwuSpnGO45qJi5q/upAX td/ph8SMdIiAYtZ/peTOMF3QmQwa3clL4wrhCnUbANxrZcFTxUDva1AP+1xFAcx+1To0 wcVyWMG1/BNJFSSa1s00hGxIoqRK5pjAs7w8IqYRmKjWE/Hu7Vh5Jg7C2VEDgbXnnUGc 0WRqWzwGRq+PE4L2tHO7Ge8N5j2G2cVHxqFF0Ym1rDQf3n05fh7ziVrzsDRXHObTRRMb xdoQ== MIME-Version: 1.0 X-Received: by 10.152.8.161 with SMTP id s1mr275547laa.67.1397567782655; Tue, 15 Apr 2014 06:16:22 -0700 (PDT) Received: by 10.114.158.130 with HTTP; Tue, 15 Apr 2014 06:16:22 -0700 (PDT) Received: by 10.114.158.130 with HTTP; Tue, 15 Apr 2014 06:16:22 -0700 (PDT) In-Reply-To: <20140415090542.GB10918@frosties> References: <20140414081214.GD25532@frosties> <20140415090542.GB10918@frosties> Date: Tue, 15 Apr 2014 07:16:22 -0600 Message-ID: From: Arthur Breitman To: Goswin von Brederlow Cc: caml-list Content-Type: multipart/alternative; boundary=089e0158b8f63c1afa04f7149b39 Subject: Re: [Caml-list] Keeping A big data optimization problem functional --089e0158b8f63c1afa04f7149b39 Content-Type: text/plain; charset=UTF-8 On Apr 15, 2014 5:06 AM, "Goswin von Brederlow" wrote: > > On Mon, Apr 14, 2014 at 01:32:18PM -0600, Arthur Breitman wrote: > > Responses in-line > > > > On Mon, Apr 14, 2014 at 2:12 AM, Goswin von Brederlow wrote: > > > > > On Sun, Apr 13, 2014 at 11:25:18PM -0600, Arthur Breitman wrote: > > > > Hello all, > > > > > > Another structure I'm dealing with is a key value store, whose size is on > > > > the same order of magnitude as the size of the data above. Initially, > > > this > > > > key-value store is empty. > > > > > > You probably don't expect to keep 1TB data in ram. Does the key-value > > > store just reference values in the DB? > > > > > > Indeed it does. That's part of the difficulty. > > > > > > > I am given an "apply" function that, for a given key-value store and for > > > a > > > > given node in the tree produces a new key-value store. Typically by > > > > modifying a few keys in the key-value store. The amount of bits changed > > > is > > > > on the same order of magnitude as the size of the node. When I reach a > > > leaf > > > > of the tree, I can compute the optimality of that leaf from the state of > > > > the key-value store. I want to find the best leaf. > > > > > > Not sure what you mean there. Maybe you could give an example with a > > > small data set (maybe 10 keys). > > > > > > > It would be hard to make it a small example, but abstractly I have: > > > > type node = {value: string; children: node list option} > > A list can be [], no need for an option usualy. Or is there a > difference between None and Some []? 1) yes the "option" part is useless, I missed it in the wrote > > module type Context = begin > > type t > > val apply: t -> node -> t > > val get: t -> string -> string option > > val set: t -> string -> string -> t > > val fitness -> int > > val empty -> t > > end > > Shouldn't that be: > > module type Context = begin > type t > type db_ref > val apply: t -> node -> t > val get: t -> string -> db_ref option > val set: t -> string -> db_ref -> t > val fitness -> int > val empty -> t > end > 2) I'm not sure what you mean by a "db_ref", but I don't think the module signature you describe is what I want. I stand by the signature I wrote. What I want is a functional map that just happens to be so large it needs a disk database to be stored. > > Given a tree of nodes, and starting with Context.empty find the leaf node > > with the highest fitness. You can think of every node in the tree as an > > operation acting on the context. > > > > > > > > > > Wouldn't a simple Map work as your key-value store? The apply function > > > would then modify the map, creating a new map that shares most of the > > > unchanged data with the original map. That way keeping the results of > > > many apply functions in memory should be possible and applying should > > > be fast. > > > > > > > It would if I didn't potentially have a terabyte of data. > > But you wouldn't put the data into the map, only the reference to the > DB. That would work if I had a small number of keys with large value types. Instead I have on the order of a billion keyvalue pairs. > > > > But if memory becomes a problem then you probably want to use some > > > form of B-Tree to. > > > > > > MfG > > > Goswin > > -- > > Arthur Breitman > > MfG > Goswin > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs --089e0158b8f63c1afa04f7149b39 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


On Apr 15, 2014 5:06 AM, "Goswin von Brederlow" <goswin-v-b@web.de> wrote:
>
> On Mon, Apr 14, 2014 at 01:32:18PM -0600, Arthur Breitman wrote:
> > Responses in-line
> >
> > On Mon, Apr 14, 2014 at 2:12 AM, Goswin von Brederlow <goswin-v-b@web.de>wrote:
> >
> > > On Sun, Apr 13, 2014 at 11:25:18PM -0600, Arthur Breitman wr= ote:
> > > > Hello all,
> > >
> > > Another structure I'm dealing with is a key value store,= whose size is on
> > > > the same order of magnitude as the size of the data abo= ve. Initially,
> > > this
> > > > key-value store is empty.
> > >
> > > You probably don't expect to keep 1TB data in ram. Does = the key-value
> > > store just reference values in the DB?
> >
> >
> > Indeed it does. That's part of the difficulty.
> >
> >
> > =C2=A0> I am given an "apply" function that, for a g= iven key-value store and for
> > > a
> > > > given node in the tree produces a new key-value store. = Typically by
> > > > modifying a few keys in the key-value store. The amount= of bits changed
> > > is
> > > > on the same order of magnitude as the size of the node.= When I reach a
> > > leaf
> > > > of the tree, I can compute the optimality of that leaf = from the state of
> > > > the key-value store. I want to find the best leaf.
> > >
> > > Not sure what you mean there. Maybe you could give an exampl= e with a
> > > small data set (maybe 10 keys).
> > >
> >
> > It would be hard to make it a small example, but abstractly I hav= e:
> >
> > type node =3D {value: string; children: node list option}
>
> A list can be [], no need for an option usualy. Or is there a
> difference between None and Some []?

1) yes the "option" part is useless, I missed it i= n the wrote

> > module type Context =3D begin
> > =C2=A0 type t
> > =C2=A0 val apply: t -> node -> t
> > =C2=A0 val get: t -> string -> string option
> > =C2=A0 val set: t -> string -> string -> t
> > =C2=A0 val fitness -> int
> > =C2=A0 val empty -> t
> > end
>
> Shouldn't that be:
>
> module type Context =3D begin
> =C2=A0 type t
> =C2=A0 type db_ref
> =C2=A0 val apply: t -> node -> t
> =C2=A0 val get: t -> string -> db_ref option
> =C2=A0 val set: t -> string -> db_ref -> t
> =C2=A0 val fitness -> int
> =C2=A0 val empty -> t
> end
>

2) I'm not sure what you mean by a "db_ref", b= ut I don't think the module signature you describe is what I want. I st= and by the signature I wrote. What I want is a functional map that just hap= pens to be so large it needs a disk database to be stored.


> > Given a tree of nodes, and starting with Context.e= mpty find the leaf node
> > with the highest fitness. You can think of every node in the tree= as an
> > operation acting on the context.
> >
> >
> > >
> > > Wouldn't a simple Map work as your key-value store? The = apply function
> > > would then modify the map, creating a new map that shares mo= st of the
> > > unchanged data with the original map. That way keeping the r= esults of
> > > many apply functions in memory should be possible and applyi= ng should
> > > be fast.
> > >
> >
> > It would if I didn't potentially have a terabyte of data.
>
> But you wouldn't put the data into the map, only the reference to = the
> DB.

That would work if I had a small number of keys with large v= alue types. Instead I have on the order of a billion keyvalue pairs.

>
> > > But if memory becomes a problem then you probably want to us= e some
> > > form of B-Tree to.
> > >
> > > MfG
> > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 Goswin
> > --
> > Arthur Breitman
>
> MfG
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 Goswin
>
> --
> Caml-list mailing list. =C2=A0Subscription management and archives:
> https://sympa.i= nria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://cam= l.inria.fr/bin/caml-bugs

--089e0158b8f63c1afa04f7149b39--