caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Q: automatic forgetting cache, module Weak, Gc control
@ 2004-06-30 11:05 Jan Kybic
  2004-07-01  8:43 ` Jacques GARRIGUE
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Kybic @ 2004-06-30 11:05 UTC (permalink / raw)
  To: caml-list

Hello,

        I am implementing a complicated numerical computation
algorithm in OCaml. (Basically, it is a Fast Mulipole Method
accceleration for a Boundary Element Method.) There is a lot of
intermediate results of several different types that are expensive to
calculate but that are used repeatedly in subsequent
calculations. Therefore, I am caching (memoizing) these intermediate
results using a hash table, like this:


  val f : ('a -> 'b) -> ('a -> 'b)

  let memo f = 
    let h = Hashtbl.create 11 in
    fun x -> try 
      Hashtbl.find h x 
    with Not_found -> 
      let r = f x in ( Hashtbl.add h x r; r )

This works fine for smaller problems and the calculation is much
accelerated. However, when the problem gets big and the amount of
memory needed to store the intermediate results exceeds the amount of
available RAM, the operating system (Linux) starts to swap the cache
memory to disk. At this point, it is no longer advantageous to cache
the values, it is actually faster to recalculate them then to retrieve
them from the disk. 

The solution seems to be to implement a forgetting LRU
(last-recently-used) cache of a fixed maximum size. Here come my
questions:

1) Has anyone implemented that in Ocaml already? 
   I have found http://bleu.west.spy.net/~dustin/projects/ocaml/doc/Lru.html
   Do you have any experience with it?


2) I have several different types of objects (functions) to cache and
   it is difficult to find out a priori, how much memory should be
   devoted to each type, if each cache is implemented separetely.  It
   would be nice to be able to make all caches share the same pool but
   storing values of different type (and size) in the same structure
   conflicts with the type safety. Should I use the Obj module to
   circumvent this, or is there a more elegant way around?

3) Many (but not all) of the precomputed values are floats. Does it
   pay off to treat them separately? Is there a way to avoid storing
   them boxed, say in a hash-table?

4) An ideal way would seem to be using the Weak module and the weak hash
   table implemented there. Will that work as expected? From reading
   the documentation I have the impression that the garbage collection
   will probably collect the weak values too early, at first minor
   collection. Is it correct? Is there a way to tune the garbage
   collector to leave the weak values alone as long as the total
   memory usage is below a certain threshold?

Thanks very much in advance for all your thoughts and ideas.

Jan
   
  

-- 
-------------------------------------------------------------------------
Jan Kybic <kybic@ieee.org>                  tel. +420 2 2435 7264
       or <kybic@fel.cvut.cz>,     http://cmp.felk.cvut.cz/~kybic

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Q: automatic forgetting cache, module Weak, Gc control
  2004-06-30 11:05 [Caml-list] Q: automatic forgetting cache, module Weak, Gc control Jan Kybic
@ 2004-07-01  8:43 ` Jacques GARRIGUE
  2004-07-01 13:59   ` skaller
  0 siblings, 1 reply; 7+ messages in thread
From: Jacques GARRIGUE @ 2004-07-01  8:43 UTC (permalink / raw)
  To: kybic; +Cc: caml-list

From: Jan Kybic <kybic@fel.cvut.cz>
> 
>         I am implementing a complicated numerical computation
> algorithm in OCaml. (Basically, it is a Fast Mulipole Method
> accceleration for a Boundary Element Method.) There is a lot of
> intermediate results of several different types that are expensive to
> calculate but that are used repeatedly in subsequent
> calculations. Therefore, I am caching (memoizing) these intermediate
> results using a hash table, like this:
[...]
> This works fine for smaller problems and the calculation is much
> accelerated. However, when the problem gets big and the amount of
> memory needed to store the intermediate results exceeds the amount of
> available RAM, the operating system (Linux) starts to swap the cache
> memory to disk. At this point, it is no longer advantageous to cache
> the values, it is actually faster to recalculate them then to retrieve
> them from the disk. 

This seems a typical work for weak hashtables.

> 4) An ideal way would seem to be using the Weak module and the weak hash
>    table implemented there. Will that work as expected? From reading
>    the documentation I have the impression that the garbage collection
>    will probably collect the weak values too early, at first minor
>    collection. Is it correct? Is there a way to tune the garbage
>    collector to leave the weak values alone as long as the total
>    memory usage is below a certain threshold?

I don't think you can tweak it, but at least it seems that only weak
values inside the old generation are collected, i.e. only with the
major garbage collector. I don't know whether this is a feature.

In practice this should mean that your memoized values would stay in
memory long enough to be useful, and that you don't have to care about
memory management. If the above property fails, then you might need
another mechanism. For instance you could check Gc.quick_stat
regularly to see if you've got a major collection.

Jacques Garrigue

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Q: automatic forgetting cache, module Weak, Gc control
  2004-07-01  8:43 ` Jacques GARRIGUE
@ 2004-07-01 13:59   ` skaller
  2004-07-01 16:25     ` [Caml-list] " Jan Kybic
  0 siblings, 1 reply; 7+ messages in thread
From: skaller @ 2004-07-01 13:59 UTC (permalink / raw)
  To: Jacques GARRIGUE; +Cc: kybic, caml-list

On Thu, 2004-07-01 at 18:43, Jacques GARRIGUE wrote:
[caching results]

A more labor intensive idea would be to manage the
caches yourself. This has the advantage -- and disadvantage --
that you're totally in control.

One idea -- just use a fixed sized cyclic buffer,
keep usage stats, and manually tune the buffer sizes.

Another more flexible and potentially expensive
technique is to cache everything in a thread-safe
manner, and run a separate 'reaper' thread to purge
the most useless values. This has the advantage that
the cache control is lexically isolated and so 
can be worked on independently. Obviously needing
to use a thread-lock to put values in the cache
will cost big time. 

The big plus here is that you can
actually tune the reaping dynamically .. that is,
whilst the program is actually running.
If you really want to get fancy you could actually
make a second thread that output usage stats on 
a socket, and collect them in some kind of crude
dynamic graph you could watch, so as to understand
the loading better.

I'm NOT recommending this technique over using
Weak/GC tuning, I think you should try that first
since it integrates seamlessly with the actual
memory management system.

-- 
John Skaller, mailto:skaller@users.sf.net
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net



-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Caml-list] Re: Q: automatic forgetting cache, module Weak, Gc control
  2004-07-01 13:59   ` skaller
@ 2004-07-01 16:25     ` Jan Kybic
  2004-07-01 18:14       ` Alex Baretta
  2004-07-02  8:53       ` Hendrik Tews
  0 siblings, 2 replies; 7+ messages in thread
From: Jan Kybic @ 2004-07-01 16:25 UTC (permalink / raw)
  To: caml-list

[This thread is about how to implement caching (memoizing) of function
values of different type with limited memory.]

> A more labor intensive idea would be to manage the
> caches yourself. This has the advantage -- and disadvantage --
> that you're totally in control.

I tried the weak hashtable from the Weak module. It did not work well,
apparently the values got garbage collected almost immediately.
I do not think there is a way to avoid it.

I also tried the LRU module of Dustin Sallings but it is too slow for me.

> One idea -- just use a fixed sized cyclic buffer,
> keep usage stats, and manually tune the buffer sizes.


I think the right idea is to insert each cached values into two
structures: a weak hashtable so that the value can be found fast, and 
another global FIFO type structure that will start to drop oldest values when
there is not enough memory. For efficiency, the FIFO structure will 
hold blocks (arrays). As the FIFO structure is global and will have to
hold different types of data, storing Obj.t seems to be apropriate.

I also have to deal with the fact that I want to cache different
fucntions that do not take the same effort to evaluate. I plan to
implement a self tuning strategy - when evaluating the function for
the first time, we will measure how much time it takes and then adjust
the block size, so that each block represents about the same
computational effort. 

I am also thinking about other strategies taking into account the
sizes of the produced results. In this case the global structure would
be an ordered set.

The disadvantage of putting the values into two structures is the
memory overhead. Maybe I should avoid using Weak hashtables
alltogether and store the values in a set of normal hashtables,
dropping some of them, if necessary... It is difficult to get the
memory/speed tradeoff right.

> Another more flexible and potentially expensive
> technique is to cache everything in a thread-safe
> manner, and run a separate 'reaper' thread to purge
> the most useless values. This has the advantage that
> the cache control is lexically isolated and so 
> can be worked on independently. Obviously needing
> to use a thread-lock to put values in the cache
> will cost big time. 

Yes, but what is the most useless value and how to find it
efficiently? 

Jan

-- 
-------------------------------------------------------------------------
Jan Kybic <kybic@ieee.org>                  tel. +420 2 2435 7264
       or <kybic@fel.cvut.cz>,     http://cmp.felk.cvut.cz/~kybic

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Re: Q: automatic forgetting cache, module Weak, Gc control
  2004-07-01 16:25     ` [Caml-list] " Jan Kybic
@ 2004-07-01 18:14       ` Alex Baretta
  2004-07-02  8:53       ` Hendrik Tews
  1 sibling, 0 replies; 7+ messages in thread
From: Alex Baretta @ 2004-07-01 18:14 UTC (permalink / raw)
  To: Jan Kybic; +Cc: caml-list

Jan Kybic wrote:

> I am also thinking about other strategies taking into account the
> sizes of the produced results. In this case the global structure would
> be an ordered set.
> 
> The disadvantage of putting the values into two structures is the
> memory overhead. Maybe I should avoid using Weak hashtables
> alltogether and store the values in a set of normal hashtables,
> dropping some of them, if necessary... It is difficult to get the
> memory/speed tradeoff right.

You can memoize your results in a generic manner through the use of maps 
which use result hashes as keys. Two results hashing down to the same 
value will not be stored in the same map. The number of significant bits 
used by the hash function determines the size of the map.

Alex

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Re: Q: automatic forgetting cache, module Weak, Gc control
  2004-07-01 16:25     ` [Caml-list] " Jan Kybic
  2004-07-01 18:14       ` Alex Baretta
@ 2004-07-02  8:53       ` Hendrik Tews
  2004-07-02  9:40         ` Jan Kybic
  1 sibling, 1 reply; 7+ messages in thread
From: Hendrik Tews @ 2004-07-02  8:53 UTC (permalink / raw)
  To: Jan Kybic; +Cc: caml-list

Jan Kybic <kybic@fel.cvut.cz> writes:

   I think the right idea is to insert each cached values into two
   structures: a weak hashtable so that the value can be found fast, and 
   another global FIFO type structure that will start to drop oldest values when
   there is not enough memory. For efficiency, the FIFO structure will 
   hold blocks (arrays). As the FIFO structure is global and will have to
   hold different types of data, storing Obj.t seems to be apropriate.
   
Why don't you use a variant type?

Bye,

Hendrik

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Re: Q: automatic forgetting cache, module Weak, Gc control
  2004-07-02  8:53       ` Hendrik Tews
@ 2004-07-02  9:40         ` Jan Kybic
  0 siblings, 0 replies; 7+ messages in thread
From: Jan Kybic @ 2004-07-02  9:40 UTC (permalink / raw)
  To: caml-list

>    there is not enough memory. For efficiency, the FIFO structure will 
>    hold blocks (arrays). As the FIFO structure is global and will have to
>    hold different types of data, storing Obj.t seems to be apropriate.
>    
> Why don't you use a variant type?

Because I do not know how to generate new variant labels on the fly.
I want to have a (possibly functorized) factory function

memo : ( a -> b ) -> ( a -> b )

such that    'memo f'   gives the same results as  'f', only faster,
as it will cache the results. Now, 'memo' is used many times in the
program, with different functions 'f'. If I store the results in a
variant type then for each of the invocation of 'memo' I need to
generate a different label. I do not know how to do that.

Jan

-- 
-------------------------------------------------------------------------
Jan Kybic <kybic@ieee.org>                  tel. +420 2 2435 7264
       or <kybic@fel.cvut.cz>,     http://cmp.felk.cvut.cz/~kybic

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-07-02  9:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-06-30 11:05 [Caml-list] Q: automatic forgetting cache, module Weak, Gc control Jan Kybic
2004-07-01  8:43 ` Jacques GARRIGUE
2004-07-01 13:59   ` skaller
2004-07-01 16:25     ` [Caml-list] " Jan Kybic
2004-07-01 18:14       ` Alex Baretta
2004-07-02  8:53       ` Hendrik Tews
2004-07-02  9:40         ` Jan Kybic

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).