Re: [Caml-list] "super-compaction" of values

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* Re:  [Caml-list] "super-compaction" of values
@ 2001-08-01 13:12 Damien Doligez
  0 siblings, 0 replies; 11+ messages in thread
From: Damien Doligez @ 2001-08-01 13:12 UTC (permalink / raw)
  To: caml-list

>From: Markus Mottl <markus@mail4.ai.univie.ac.at>

>would return "la" unchanged but "lb" would now share the common structure
>with list_a (= la) (also when inserted in any other order).

Look for "hash-consing" in the (LISP) literature.  But doing it for
arbitrary types looks complex.


>  * Support automatic GC? Or should the user make sure that possible
>    administrative memory overhead is prevented by telling "'a t" that
>    some value need not be remembered for sharing with further values
>    anymore.  The "automatic" version is probably less efficient and
>    much more complicated.

Module Weak is here to help you solve this problem.

-- Damien
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] "super-compaction" of values
  2001-08-02 13:28 ` Gregory Morrisett
@ 2001-08-02 15:56   ` Markus Mottl
  0 siblings, 0 replies; 11+ messages in thread
From: Markus Mottl @ 2001-08-02 15:56 UTC (permalink / raw)
  To: Gregory Morrisett; +Cc: OCAML

On Thu, 02 Aug 2001, Gregory Morrisett wrote:
> Similarly, Zhong Shao's Flint IL uses hash-consing for type terms
> quite effectively.  I can dig up the reference if you like.  In
> these two settings, one has to worry about the exponential blow
> up you can get by turning a DAG into a tree.

This is not a problem for me, because the trees really stay trees (as
datastructure), only that OCaml will represent them as DAGs in memory
after the transformation. Of course, if you want to print such a tree,
you'll have to cut exponentially many trees for the paper ;)

So there are really two kinds of DAGs: the explicitly constructed one that
uses (integer) indices to refer to subtrees, and the "normal" tree, which
uses sharing. Unfortunately, I can't use the sharing-tree representation
with memory addresses as indices alone, because the GC can move things,
which destroys any kind of order relation based on memory addresses to
search for matching subtrees... :(

> In addition, the hash-consing made structural equality tests quite cheap
> (O(1)) which is important for type checking or proof verification.

Well, one has to "hash-cons" a freshly generated tree at least once,
but then one can find equivalent (sub)trees very fast, because they'll be
pointer-equal. If they aren't, they cannot be structurally equivalent. The
algorithm I have in mind shouldn't require more than O(n*log(n)) time
(n being the number of nodes in the tree) for merging a fresh tree into
the hash-consing datastructure. Hm, let's see...

> For TAL, GC was not an issue, though we thought it might be.  For Flint,
> if memory serves, they periodically flushed the table or used some
> finalization/weak-pointer tricks.

In the first attempt I'll probably use "manual" removing of unneeded
trees from the hash-consing datastructure. Maybe then I'll play some
finalization tricks that remove unreachable trees automatically, only
leaving subtrees shared with not yet removed trees.

Thanks to all of you for your numerous answers!

Regards,
Markus Mottl

-- 
Markus Mottl                                             markus@oefai.at
Austrian Research Institute
for Artificial Intelligence                  http://www.oefai.at/~markus
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [Caml-list] "super-compaction" of values
@ 2001-08-02 13:46 Dave Berry
  2001-08-02 13:28 ` Gregory Morrisett
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Berry @ 2001-08-02 13:46 UTC (permalink / raw)
  To: Gregory Morrisett, Markus Mottl, John R Harrison; +Cc: caml-list

In a similar vein, way back in the 1980's Kevin Mitchell added
hash-consing to the top-level environment of the Edinburgh ML
interpreter.  This saved a great deal of space.

-----Original Message-----
From: Gregory Morrisett [mailto:jgm@cs.cornell.edu]
Sent: 02 August 2001 14:28
To: Markus Mottl; John R Harrison
Cc: caml-list@inria.fr
Subject: RE: [Caml-list] "super-compaction" of values

> > I ended up writing a naive implementation, without making stuff
> > garbage-collectible, but only using it for structures I knew were
> > persistent. To my chagrin, it turned out that Malcolm was absolutely
> > right. Space usage actually went *up*, presumably because 
> the hashing
> > datastructures were large enough to overwhelm the small amount of
> > sharing.

For the TAL type-checker, we used hash-consing to represent type terms
(which are quite big for TAL) and this had a significantly good impact
on performance.  See:

Dan Grossman and Greg Morrisett.  Scalable Certification for Typed
Assembly Language.  In the 2000 ACM SIGPLAN Workshop on Types in
Compilation, Montreal, Canada, September 2000.
www.cs.cornell.edu/talc/papers/tal_scale.ps

Similarly, Zhong Shao's Flint IL uses hash-consing for type terms
quite effectively.  I can dig up the reference if you like.  In
these two settings, one has to worry about the exponential blow
up you can get by turning a DAG into a tree.  In addition, the
hash-consing made structural equality tests quite cheap (O(1))
which is important for type checking or proof verification.  For
TAL, GC was not an issue, though we thought it might be.  For Flint,
if memory serves, they periodically flushed the table or used
some finalization/weak-pointer tricks.  

And long ago, I remember that Eric Cooper hacked the SML/NJ collector
to do hash-consing on major collections for immutable objects so as
to generically compress the heap.  If memory serves, he got fantastic
reductions in overall space (at the expense of much slower collections.)
Those were the days when SML/NJ did a lot of paging on a standard
workstation...

Another interesting data point is that Scott Nettles and his group
at Penn did some work compressing heap images and found that they
compress remarkably well, which suggests that we waste a lot of space.

-Greg
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ:
http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives:
http://caml.inria.fr
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [Caml-list] "super-compaction" of values
@ 2001-08-02 13:28 ` Gregory Morrisett
  2001-08-02 15:56   ` Markus Mottl
  0 siblings, 1 reply; 11+ messages in thread
From: Gregory Morrisett @ 2001-08-02 13:28 UTC (permalink / raw)
  To: Markus Mottl, John R Harrison; +Cc: caml-list

> > I ended up writing a naive implementation, without making stuff
> > garbage-collectible, but only using it for structures I knew were
> > persistent. To my chagrin, it turned out that Malcolm was absolutely
> > right. Space usage actually went *up*, presumably because 
> the hashing
> > datastructures were large enough to overwhelm the small amount of
> > sharing.

For the TAL type-checker, we used hash-consing to represent type terms
(which are quite big for TAL) and this had a significantly good impact
on performance.  See:

Dan Grossman and Greg Morrisett.  Scalable Certification for Typed
Assembly Language.  In the 2000 ACM SIGPLAN Workshop on Types in
Compilation, Montreal, Canada, September 2000.
www.cs.cornell.edu/talc/papers/tal_scale.ps

Similarly, Zhong Shao's Flint IL uses hash-consing for type terms
quite effectively.  I can dig up the reference if you like.  In
these two settings, one has to worry about the exponential blow
up you can get by turning a DAG into a tree.  In addition, the
hash-consing made structural equality tests quite cheap (O(1))
which is important for type checking or proof verification.  For
TAL, GC was not an issue, though we thought it might be.  For Flint,
if memory serves, they periodically flushed the table or used
some finalization/weak-pointer tricks.  

And long ago, I remember that Eric Cooper hacked the SML/NJ collector
to do hash-consing on major collections for immutable objects so as
to generically compress the heap.  If memory serves, he got fantastic
reductions in overall space (at the expense of much slower collections.)
Those were the days when SML/NJ did a lot of paging on a standard
workstation...

Another interesting data point is that Scott Nettles and his group
at Penn did some work compressing heap images and found that they
compress remarkably well, which suggests that we waste a lot of space.

-Greg
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] "super-compaction" of values
       [not found] <200108011606.JAA29019@dhpc0010.pdx.intel.com>
@ 2001-08-02  9:42 ` Markus Mottl
  0 siblings, 0 replies; 11+ messages in thread
From: Markus Mottl @ 2001-08-02  9:42 UTC (permalink / raw)
  To: John R Harrison; +Cc: caml-list

On Wed, 01 Aug 2001, John R Harrison wrote:
> About 5 years ago I became very enthusiastic about using hash-consing in
> the HOL Light prover to share common subterms of theorems. I was
> convinced it would cut space usage dramatically.

Well, it's actually not so much space savings that I am interested in,
but mostly exploiting (sub-)program equivalences to speed up expensive
computations.

> I ended up writing a naive implementation, without making stuff
> garbage-collectible, but only using it for structures I knew were
> persistent. To my chagrin, it turned out that Malcolm was absolutely
> right. Space usage actually went *up*, presumably because the hashing
> datastructures were large enough to overwhelm the small amount of
> sharing.

I can well imagine this. In my case it's not "real" programming languages
that I have to deal with, but rather quite confined ones, and there are
likely to be hundreds of not-so-large programs that mostly differ by a
few subtrees only (= lots of opportunities for sharing). It seems worth
a try...

Regards,
Markus Mottl

-- 
Markus Mottl                                             markus@oefai.at
Austrian Research Institute
for Artificial Intelligence                  http://www.oefai.at/~markus
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] "super-compaction" of values
  2001-08-01 10:20 Markus Mottl
  2001-08-01 13:41 ` Thorsten Ohl
@ 2001-08-02  1:48 ` Alexander V. Voinov
  1 sibling, 0 replies; 11+ messages in thread
From: Alexander V. Voinov @ 2001-08-02  1:48 UTC (permalink / raw)
  To: Markus Mottl; +Cc: OCAML

Hi Markus

Markus Mottl wrote:
> would return "la" unchanged but "lb" would now share the common structure
> with list_a (= la) (also when inserted in any other order).

It may not be applicable here, but I recall that in Prolog there was a
long argument between structure sharing and structure copying, and the
latter won with new cheaper memories.

Alexander
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] "super-compaction" of values
  2001-08-01 14:29     ` Thorsten Ohl
@ 2001-08-01 15:25       ` Markus Mottl
  0 siblings, 0 replies; 11+ messages in thread
From: Markus Mottl @ 2001-08-01 15:25 UTC (permalink / raw)
  To: Thorsten Ohl; +Cc: caml-list

Thorsten Ohl schrieb am Mittwoch, den 01. August 2001:
> This implies that you either have a good comparison function for your
> (sub-)trees or a unique representation or that you can live with a
> small fraction of program trees that are not recognized as equivalent.

Comparing program trees for equivalence is straightforward. Their concrete
representation is:

  type tree = Tree of int * tree array | Leaf of int

"Tree (1, [| Leaf 2 |])" would read "from the current nonterminal (e.g.
start symbol), choose production 1, which contains a terminal in state 2".

Of course, semantic equivalence doesn't imply syntactic equivalence.
Exploitation of semantic equivalences is handled by preprocessing programs
with user specified rewrite rules to get normal forms. The latter can
then be exploited with syntactic equivalences again.

> In this case you can represent trees as maps from parent nodes to sets
> of child nodes and you get a DAG representation with automatically
> shared common subexpressions for free.

This is the non-generic solution I currently have in mind. I'd only
have to convert between my representation and this one so it shouldn't
be terribly difficult.

> A different solution could probably be build on a variation of Chris
> Okasaki's Tries of Trees. [I have only a bare bones O'Caml version,
> but the usual tricks for working around the lack of polymorphic
> recursion work.]

Yes, I remember this chapter (though I currently don't have the book
at hand). I'll probably take another look at it, maybe it inspires me
to implement a solution I like. I had hoped that somebody might have
already implemented a generic solution, but so I'll have to continue
experimenting...

Thanks a lot for the help!

Regards,
Markus Mottl

-- 
Markus Mottl                                             markus@oefai.at
Austrian Research Institute
for Artificial Intelligence                  http://www.oefai.at/~markus
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] "super-compaction" of values
  2001-08-01 14:09   ` Markus Mottl
@ 2001-08-01 14:29     ` Thorsten Ohl
  2001-08-01 15:25       ` Markus Mottl
  0 siblings, 1 reply; 11+ messages in thread
From: Thorsten Ohl @ 2001-08-01 14:29 UTC (permalink / raw)
  To: caml-list

Markus Mottl writes:

> I generate programs (abstract syntax trees) almost randomly. It can
> happen very often that generated programs have already been tried or
> share large subtrees with already known solutions. The programs are
> usually not terribly large, but evaluating them is extremely
> expensive.  Therefore, I would like to prevent it at all if the
> meaning is already known or at least speed it up significantly by
> reusing the meaning of subtrees (the programs are referentially
> transparent, of course).

This implies that you either have a good comparison function for your
(sub-)trees or a unique representation or that you can live with a
small fraction of program trees that are not recognized as equivalent.
In this case you can represent trees as maps from parent nodes to sets
of child nodes and you get a DAG representation with automatically
shared common subexpressions for free.

A different solution could probably be build on a variation of Chris
Okasaki's Tries of Trees. [I have only a bare bones O'Caml version,
but the usual tricks for working around the lack of polymorphic
recursion work.]
-- 
Thorsten Ohl, Physics Department, TU Darmstadt -- ohl@hep.tu-darmstadt.de
http://heplix.ikp.physik.tu-darmstadt.de/~ohl/ [<=== PGP public key here]
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] "super-compaction" of values
  2001-08-01 13:41 ` Thorsten Ohl
@ 2001-08-01 14:09   ` Markus Mottl
  2001-08-01 14:29     ` Thorsten Ohl
  0 siblings, 1 reply; 11+ messages in thread
From: Markus Mottl @ 2001-08-01 14:09 UTC (permalink / raw)
  To: Thorsten Ohl; +Cc: caml-list

Thorsten Ohl schrieb am Mittwoch, den 01. August 2001:
> In my experience, one can beat the combinatorial complexity only if one
> understands the genesis of such values with high sharing potential.
> Then one can construct algorithms that start from building all
> possible combinations in a highly condensed representation and select
> the interesting ones later.  This will be much more efficient than
> building only the interesting ones and searching for shared nodes later.

My requirements are a bit different: I generate programs (abstract
syntax trees) almost randomly. It can happen very often that generated
programs have already been tried or share large subtrees with already
known solutions. The programs are usually not terribly large, but
evaluating them is extremely expensive.  Therefore, I would like to
prevent it at all if the meaning is already known or at least speed it
up significantly by reusing the meaning of subtrees (the programs are
referentially transparent, of course).

Merging the programs into a shared datastructure would allow me to quickly
find out about (syntactic) equivalences of whole programs or their parts.

The three options I currently have:

  * Write a specialized implemention for program trees. Would probably be
    most efficient, because I can exploit some of their properties.

  * Write a generic implementation, which lets the user provide decompose-
    and comparison functions for his types. The latter puts some work
    on the user, but is otherwise generic and elegant.

  * Write a fully generic implementation that works on any type by
    traversing the low-level representation of OCaml-values. Somewhat
    messy to implement and may break when representations change...

A difficult question how to proceed best. Considering changes to
datastructures, a generic solution would be fine. This would also allow me
(and you :) to reuse it...

Regards,
Markus Mottl

-- 
Markus Mottl                                             markus@oefai.at
Austrian Research Institute
for Artificial Intelligence                  http://www.oefai.at/~markus
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Caml-list] "super-compaction" of values
  2001-08-01 10:20 Markus Mottl
@ 2001-08-01 13:41 ` Thorsten Ohl
  2001-08-01 14:09   ` Markus Mottl
  2001-08-02  1:48 ` Alexander V. Voinov
  1 sibling, 1 reply; 11+ messages in thread
From: Thorsten Ohl @ 2001-08-01 13:41 UTC (permalink / raw)
  To: caml-list

Markus Mottl writes:

> has anybody already attempted to implement some kind of
> "super-compaction" [...] If this worked for arbitrary
> datastructures, one could exploit this to produce very compact
> representations of values by maximizing the amount of sharing
> between them.

I'm sure that one can not beat the combinatorial complexity of the
general case.  For example, I work with algebraic expressions that
grow in size like (2n-1)!! = (2n-1)(2n-3)(2n-5)... if represented as
trees, but can be rewritten as DAGs of size 10**(n/2) [see
http://heplix.ikp.physik.tu-darmstadt.de/~ohl/omega/ and in particular
the chapter Directed Acyclical Graphs in
http://heplix.ikp.physik.tu-darmstadt.de/~ohl/omega/omega.pdf].

In my experience, one can beat the combinatorial complexity only if
one understands the genesis of such values with high sharing
potential.  Then one can construct algorithms that start from building
all possible combinations in a highly condensed representation and
select the interesting ones later.  This will be much more efficient
than building only the interesting ones and searching for shared nodes
later.

But this is never a general solution, because it requires a good
unterstanding of the structure of possible terms.
-- 
Thorsten Ohl, Physics Department, TU Darmstadt -- ohl@hep.tu-darmstadt.de
http://heplix.ikp.physik.tu-darmstadt.de/~ohl/ [<=== PGP public key here]
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Caml-list] "super-compaction" of values
@ 2001-08-01 10:20 Markus Mottl
  2001-08-01 13:41 ` Thorsten Ohl
  2001-08-02  1:48 ` Alexander V. Voinov
  0 siblings, 2 replies; 11+ messages in thread
From: Markus Mottl @ 2001-08-01 10:20 UTC (permalink / raw)
  To: OCAML

Hello,

has anybody already attempted to implement some kind of "super-compaction"
of arbitrary values, e.g.:

  let list_a = [1, 2, 3, 4]
  let list_b =       [3, 4]

Both are in completely distinct parts of the memory. What I'd find
interesting is a module like:

  type 'a t

  val empty : unit -> 'a t
  val add : 'a -> 'a t -> 'a

which, when applied like this:

  let t = empty () in
  let la = add list_a t in
  let lb = add list_b t in
  ...

would return "la" unchanged but "lb" would now share the common structure
with list_a (= la) (also when inserted in any other order).

If this worked for arbitrary datastructures, one could exploit this to
produce very compact representations of values by maximizing the amount
of sharing between them.

One could use the "Obj"-module for this purpose. Problems, however, are:

  * Probably not useful for cyclic datastructures (too inefficient), but
    support of acyclic structures only would be ok.

  * Support automatic GC? Or should the user make sure that possible
    administrative memory overhead is prevented by telling "'a t" that
    some value need not be remembered for sharing with further values
    anymore.  The "automatic" version is probably less efficient and
    much more complicated.

An extreme form of this implementation could even have functions +
types like:

  val add : 'a -> t -> 'a

This would even allow sharing between representations of different types.

A correct and efficient implementation is probably everything else but
trivial. If anybody has already tried this crazy idea, I'd be interested
to know!

Regards,
Markus Mottl

-- 
Markus Mottl                                             markus@oefai.at
Austrian Research Institute
for Artificial Intelligence                  http://www.oefai.at/~markus
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2001-08-02 15:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-08-01 13:12 [Caml-list] "super-compaction" of values Damien Doligez
  -- strict thread matches above, loose matches on Subject: below --
2001-08-02 13:46 Dave Berry
2001-08-02 13:28 ` Gregory Morrisett
2001-08-02 15:56   ` Markus Mottl
     [not found] <200108011606.JAA29019@dhpc0010.pdx.intel.com>
2001-08-02  9:42 ` Markus Mottl
2001-08-01 10:20 Markus Mottl
2001-08-01 13:41 ` Thorsten Ohl
2001-08-01 14:09   ` Markus Mottl
2001-08-01 14:29     ` Thorsten Ohl
2001-08-01 15:25       ` Markus Mottl
2001-08-02  1:48 ` Alexander V. Voinov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).