caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* caml_oldify_local_roots takes 50% of the total runtime
@ 2006-10-25 15:49 Hendrik Tews
  2006-10-25 18:21 ` [Caml-list] " Markus Mottl
  2006-12-15 10:56 ` Hendrik Tews
  0 siblings, 2 replies; 13+ messages in thread
From: Hendrik Tews @ 2006-10-25 15:49 UTC (permalink / raw)
  To: caml-list

Hi,

I have an application (olmar[1]) for gprof tells me 

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 58.33      9.52     9.52                             caml_oldify_local_roots

that is more than half of the time is spent in
caml_oldify_local_roots. The trouble is, there is no significant
garbage to collect. If I could, I would switch off garbage
collection completely.

The ocaml part of the application takes a big tree (353040 nodes
in the profiled case) in C++ and translates it into ocaml. Some
nodes are shared in the tree and there are also circles,
therefore, every C++ node has caml_global_root for its ocaml
relative (so 350.000 caml_global_roots). When the ocaml tree is
ready it is marshaled to disk and the application exits (no need
to collect garbage).

To improve the runtime I set space_overhead to max_int, but this
didn't change much (profiling above is with max_int). Apparently,
space_overhead doesn't have really have an influence on
allocate-only programs. Finally I increase minor_heap_size until
the minor collections came down to one. This brought the ocaml
running time down from 5.7s to 1.3s. 

Is there a way to adopt the size of minor heap to the program
behaviour? (BTW 32K default minor heap sounds like 1995, not like
2006) 

>From the comments of Damien Doligez in a related thread in
http://caml.inria.fr/pub/ml-archives/caml-list/2004/07/84cd291931627c13faf56a259026885c.en.html
I got the impression that there is extra punishment for
caml_local_roots. Should the situation improve if I organize the
350.000 ocaml node pointers on the ocaml side via one local root
(asumming this information is needed not so often)?


Bye,

Hendrik


[1] http://www.cs.ru.nl/~tews/olmar/


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] caml_oldify_local_roots takes 50% of the total runtime
  2006-10-25 15:49 caml_oldify_local_roots takes 50% of the total runtime Hendrik Tews
@ 2006-10-25 18:21 ` Markus Mottl
  2006-10-25 19:01   ` skaller
  2006-10-26  9:47   ` Hendrik Tews
  2006-12-15 10:56 ` Hendrik Tews
  1 sibling, 2 replies; 13+ messages in thread
From: Markus Mottl @ 2006-10-25 18:21 UTC (permalink / raw)
  To: Hendrik Tews; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 1323 bytes --]

Hi,

On 10/25/06, Hendrik Tews <H.Tews@cs.ru.nl> wrote:
>
> >From the comments of Damien Doligez in a related thread in
>
> http://caml.inria.fr/pub/ml-archives/caml-list/2004/07/84cd291931627c13faf56a259026885c.en.html
> I got the impression that there is extra punishment for
> caml_local_roots. Should the situation improve if I organize the
> 350.000 ocaml node pointers on the ocaml side via one local root
> (asumming this information is needed not so often)?
>

This is a known problem: the OCaml-runtime calls this function
(caml_oldify_local_roots) at each minor collection.  The function iterates
over all local roots, and since minor collections happen quite often, this
naturally leads to excessive CPU-usage - even if hardly anything else is
going on.

To work around this problem you should store pointers to the C++-objects on
the C++-side, and e.g. associate them with finalized OCaml-values, or
handles which allow you to explicitly deallocate the objects.

It would be great if the GC could be improved in situations where there are
many local roots.  This is a pretty common problem when interfacing
non-trivial third party libraries, and the clumsy workarounds require
writing somewhat error-prone code.

Best regards,
Markus

-- 
Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com

[-- Attachment #2: Type: text/html, Size: 1911 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] caml_oldify_local_roots takes 50% of the total runtime
  2006-10-25 18:21 ` [Caml-list] " Markus Mottl
@ 2006-10-25 19:01   ` skaller
  2006-10-26  9:50     ` Hendrik Tews
  2006-10-26 13:48     ` Gerd Stolpmann
  2006-10-26  9:47   ` Hendrik Tews
  1 sibling, 2 replies; 13+ messages in thread
From: skaller @ 2006-10-25 19:01 UTC (permalink / raw)
  To: Markus Mottl; +Cc: Hendrik Tews, caml-list

On Wed, 2006-10-25 at 14:21 -0400, Markus Mottl wrote:

> It would be great if the GC could be improved in situations where
> there are many local roots.  This is a pretty common problem when
> interfacing non-trivial third party libraries, and the clumsy
> workarounds require writing somewhat error-prone code. 

Just curious .. but why is there *ever* any need for more than
one root? If you need to add a root, why not just add to
an already rooted data structure? Would that solve the problem?

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] caml_oldify_local_roots takes 50% of the total runtime
  2006-10-25 18:21 ` [Caml-list] " Markus Mottl
  2006-10-25 19:01   ` skaller
@ 2006-10-26  9:47   ` Hendrik Tews
  2006-10-26 14:54     ` Markus Mottl
  1 sibling, 1 reply; 13+ messages in thread
From: Hendrik Tews @ 2006-10-26  9:47 UTC (permalink / raw)
  To: caml-list

"Markus Mottl" <markus.mottl@gmail.com> writes:

   To work around this problem you should store pointers to the C++-objects on
   the C++-side, and e.g. associate them with finalized OCaml-values, or
   handles which allow you to explicitly deallocate the objects.

I don't quite understand: I only have pointers from C++ to ocaml.
Once constructed the ocaml objects are completely independent
from the C++ ones.

Hendrik


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] caml_oldify_local_roots takes 50% of the total runtime
  2006-10-25 19:01   ` skaller
@ 2006-10-26  9:50     ` Hendrik Tews
  2006-10-26 13:48     ` Gerd Stolpmann
  1 sibling, 0 replies; 13+ messages in thread
From: Hendrik Tews @ 2006-10-26  9:50 UTC (permalink / raw)
  To: caml-list

skaller <skaller@users.sourceforge.net> writes:

   On Wed, 2006-10-25 at 14:21 -0400, Markus Mottl wrote:

   > It would be great if the GC could be improved in situations where
   > there are many local roots.  This is a pretty common problem when
   > interfacing non-trivial third party libraries, and the clumsy
   > workarounds require writing somewhat error-prone code. 

   Just curious .. but why is there *ever* any need for more than
   one root? If you need to add a root, why not just add to
   an already rooted data structure? Would that solve the problem?

Sure, one should be enough. However having many roots is much
more convenient.

Hendrik


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] caml_oldify_local_roots takes 50% of the total runtime
  2006-10-25 19:01   ` skaller
  2006-10-26  9:50     ` Hendrik Tews
@ 2006-10-26 13:48     ` Gerd Stolpmann
  2006-10-27 11:36       ` Hendrik Tews
  1 sibling, 1 reply; 13+ messages in thread
From: Gerd Stolpmann @ 2006-10-26 13:48 UTC (permalink / raw)
  To: skaller; +Cc: Markus Mottl, Hendrik Tews, caml-list

Am Donnerstag, den 26.10.2006, 05:01 +1000 schrieb skaller:
> On Wed, 2006-10-25 at 14:21 -0400, Markus Mottl wrote:
> 
> > It would be great if the GC could be improved in situations where
> > there are many local roots.  This is a pretty common problem when
> > interfacing non-trivial third party libraries, and the clumsy
> > workarounds require writing somewhat error-prone code. 
> 
> Just curious .. but why is there *ever* any need for more than
> one root? If you need to add a root, why not just add to
> an already rooted data structure? Would that solve the problem?
> 

Yes, that solves it. The point is that a normal data structure (e.g.
hash table, ...) can be moved to the major heap, so it is only traversed
for each major collection.

Note also that there are local and global roots. You need local roots
for every pointer that is _temporarily_ outside the heap, so yes, you
need usually a lot of them. Pointers on the stack are local roots, too.

Global roots are for pointers that are outside the heap for a long time.
I think these could be handled more efficiently by the GC, e.g. by
moving global roots to a second list when the object referenced by the
pointer is moved to the major heap.

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] caml_oldify_local_roots takes 50% of the total runtime
  2006-10-26  9:47   ` Hendrik Tews
@ 2006-10-26 14:54     ` Markus Mottl
  0 siblings, 0 replies; 13+ messages in thread
From: Markus Mottl @ 2006-10-26 14:54 UTC (permalink / raw)
  To: Hendrik Tews; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 1061 bytes --]

On 10/26/06, Hendrik Tews <H.Tews@cs.ru.nl> wrote:
>
> I don't quite understand: I only have pointers from C++ to ocaml.
> Once constructed the ocaml objects are completely independent
> from the C++ ones.
>

Ok, I thought they were entangled, i.e. there were pointers in both
directions.  But you will still need to avoid storing OCaml-values in
C++-values.

If you want to prevent them from being reclaimed as long as the C++-objects
still refer to them, you'll have to store them in e.g. an OCaml-hashtable
using e.g. integer keys as handles.  Then you can store the key in the
C++-object as an ordinary C++-value (i.e. no registering with the GC).  If
you need the OCaml-value on the C++-side, you just convert the handle to an
OCaml-value, and execute a callback to look up the value associated with the
handle in the hashtable.  Btw., if you use an OCaml-int as handle you can
store the "value" directly, because ints need not be registered with the GC
anyway.

Regards,
Markus

-- 
Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com

[-- Attachment #2: Type: text/html, Size: 1553 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] caml_oldify_local_roots takes 50% of the total runtime
  2006-10-26 13:48     ` Gerd Stolpmann
@ 2006-10-27 11:36       ` Hendrik Tews
  2006-10-27 13:17         ` Brian Hurt
  0 siblings, 1 reply; 13+ messages in thread
From: Hendrik Tews @ 2006-10-27 11:36 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: skaller, Markus Mottl, Hendrik Tews, caml-list

Gerd Stolpmann <info@gerd-stolpmann.de> writes:

   Global roots are for pointers that are outside the heap for a long time.
   I think these could be handled more efficiently by the GC, e.g. by
   moving global roots to a second list when the object referenced by the
   pointer is moved to the major heap.

I destilled this thread into two feature wishs: 4145, 4146.

Hendrik


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] caml_oldify_local_roots takes 50% of the total runtime
  2006-10-27 11:36       ` Hendrik Tews
@ 2006-10-27 13:17         ` Brian Hurt
  2006-10-27 20:05           ` Robert Roessler
                             ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Brian Hurt @ 2006-10-27 13:17 UTC (permalink / raw)
  To: Hendrik Tews; +Cc: caml-list

Hendrik Tews wrote:

>Gerd Stolpmann <info@gerd-stolpmann.de> writes:
>
>   Global roots are for pointers that are outside the heap for a long time.
>   I think these could be handled more efficiently by the GC, e.g. by
>   moving global roots to a second list when the object referenced by the
>   pointer is moved to the major heap.
>
>I destilled this thread into two feature wishs: 4145, 4146.
>
>Hendrik
>
>_______________________________________________
>Caml-list mailing list. Subscription management:
>http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
>Archives: http://caml.inria.fr
>Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>Bug reports: http://caml.inria.fr/bin/caml-bugs
>
>
>  
>
With respect to 4146 (minor heap size adjusts to memory size)- I'm not 
sure this is a good idea.  32K is small enough to fit into L1 cache with 
space left over on pretty much all systems these days (64K L1 cache 
seems to be standard).  Having the minor heap small enough to fit into 
L1 cache, and thus having the minor heap live in L1 cache, seems to me 
to be a big advantage.  If for no other reason than it makes minor 
collections a lot faster- the collector doesn't have to fault memory 
into cache, it's already there.

If anything, I'd be inclined to add more generations.  I haven't done 
any timings, but my first guess for reasonable values would be: Gen1 
(youngest): 32K (fits into L1 cache), Gen2: 256K (fits into L2 cache), 
Gen3: 8M (fits into memory), Gen4 (oldest): as large as necessary.

Brian


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] caml_oldify_local_roots takes 50% of the total runtime
  2006-10-27 13:17         ` Brian Hurt
@ 2006-10-27 20:05           ` Robert Roessler
  2006-10-27 21:16           ` Pierre Etchemaïté
  2006-10-30  7:50           ` Hendrik Tews
  2 siblings, 0 replies; 13+ messages in thread
From: Robert Roessler @ 2006-10-27 20:05 UTC (permalink / raw)
  To: Caml-list

Brian Hurt wrote:
> With respect to 4146 (minor heap size adjusts to memory size)- I'm not 
> sure this is a good idea.  32K is small enough to fit into L1 cache with 
> space left over on pretty much all systems these days (64K L1 cache 
> seems to be standard).  Having the minor heap small enough to fit into 
> L1 cache, and thus having the minor heap live in L1 cache, seems to me 
> to be a big advantage.  If for no other reason than it makes minor 
> collections a lot faster- the collector doesn't have to fault memory 
> into cache, it's already there.
> 
> If anything, I'd be inclined to add more generations.  I haven't done 
> any timings, but my first guess for reasonable values would be: Gen1 
> (youngest): 32K (fits into L1 cache), Gen2: 256K (fits into L2 cache), 
> Gen3: 8M (fits into memory), Gen4 (oldest): as large as necessary.

I have not yet looked at the details of OCaml's memory management, but 
in my commercial Smalltalk-80 implementation, I used a simple scheme:

Twin 32K "young" spaces... the allocations come from one of these 
(like OCaml's, using trivial and fast code) until it is exhausted. 
Then all of the reachable [young] objects are copied to the other 
[empty] "young" space.  Eventually, a more or less conventional 
mark-and-sweep is performed on the "old" space.

Adjustable parameters control 1) how many times an object ping-pongs 
between the two "young" spaces before it is promoted to the "old" 
space (4 was a reasonable number), and 2) what size an allocation 
needs to be before it is created as "old" (this somewhat controls the 
consumption of the "young" spaces at the expense of some improper 
"old" promotions).

This approach allowed an 8MHz 286 to perform at close to Xerox 
"Dolphin" speeds, while using [typically] less than 1% of total time 
for memory management.

Humorously, the 32K was chosen not because it would fit into cache 
(there wasn't any), but because it would allow both "young" spaces to 
fit into a single x86 64K "segment" - remember those? :)

Robert Roessler
robertr@rftp.com
http://www.rftp.com


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] caml_oldify_local_roots takes 50% of the total runtime
  2006-10-27 13:17         ` Brian Hurt
  2006-10-27 20:05           ` Robert Roessler
@ 2006-10-27 21:16           ` Pierre Etchemaïté
  2006-10-30  7:50           ` Hendrik Tews
  2 siblings, 0 replies; 13+ messages in thread
From: Pierre Etchemaïté @ 2006-10-27 21:16 UTC (permalink / raw)
  To: caml-list

Le Fri, 27 Oct 2006 09:17:03 -0400, Brian Hurt <bhurt@janestcapital.com> a écrit :

> With respect to 4146 (minor heap size adjusts to memory size)- I'm not 
> sure this is a good idea.  32K is small enough to fit into L1 cache with 
> space left over on pretty much all systems these days (64K L1 cache 
> seems to be standard).

Default size is 32K *words*, hence 128kB for 32bit archs and 256kB on 64bits.
Small enough to stay in L2 cache, but not L1.

Also, from reading 
http://citeseer.ist.psu.edu/goncalves94cache.html

It seems that write-validate cache policy makes minor heaps larger than
L2 cache a non-issue for most programs.

That paper is 10 years old, did that policy become the norm since then ?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] caml_oldify_local_roots takes 50% of the total runtime
  2006-10-27 13:17         ` Brian Hurt
  2006-10-27 20:05           ` Robert Roessler
  2006-10-27 21:16           ` Pierre Etchemaïté
@ 2006-10-30  7:50           ` Hendrik Tews
  2 siblings, 0 replies; 13+ messages in thread
From: Hendrik Tews @ 2006-10-30  7:50 UTC (permalink / raw)
  To: caml-list

Brian Hurt <bhurt@janestcapital.com> writes:

   With respect to 4146 (minor heap size adjusts to memory size)- I'm not
   sure this is a good idea.  

I didn't think at all about cache size. For some applications it
might certainly be a good idea to keep the minor heap close to
the cache size. I was mainly concerned about applications that
consume a lot of data, which is not going to fit into cache
anyway. If, in addition, they are short running it is much better
to have a huge minor heap and let the OS collect the garbage
afterwards.

Maybe it would be nice to let the application configure the cache
policy:

- eager: fill all free memory before starting GC
- fit_cache: Keep the minor heap close to the cache size

Hendrik


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] caml_oldify_local_roots takes 50% of the total runtime
  2006-10-25 15:49 caml_oldify_local_roots takes 50% of the total runtime Hendrik Tews
  2006-10-25 18:21 ` [Caml-list] " Markus Mottl
@ 2006-12-15 10:56 ` Hendrik Tews
  1 sibling, 0 replies; 13+ messages in thread
From: Hendrik Tews @ 2006-12-15 10:56 UTC (permalink / raw)
  To: Hendrik Tews; +Cc: caml-list

Hi,

issue 4145 (http://caml.inria.fr/mantis/view.php?id=4145) has
been closed in the meantime because

    doligez - 2006-11-15 13:19  
   ---------------------------------------------------------------------- 
   Unfortunately, we cannot know which global C roots point to the minor heap,
   since we don't have a write barrier on C variables.

I find this reason not satisfying at all. If a write barrier is
needed, then one could easily stipulate one by providing a
suitable macro CAML_modify_global_root.

With the above line of reasoning it should not be allowed to hold
any ocaml value in a C variable, because a write barrier is
needed for objects that are not in the young generation.

Hendrik


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2006-12-15 10:56 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-10-25 15:49 caml_oldify_local_roots takes 50% of the total runtime Hendrik Tews
2006-10-25 18:21 ` [Caml-list] " Markus Mottl
2006-10-25 19:01   ` skaller
2006-10-26  9:50     ` Hendrik Tews
2006-10-26 13:48     ` Gerd Stolpmann
2006-10-27 11:36       ` Hendrik Tews
2006-10-27 13:17         ` Brian Hurt
2006-10-27 20:05           ` Robert Roessler
2006-10-27 21:16           ` Pierre Etchemaïté
2006-10-30  7:50           ` Hendrik Tews
2006-10-26  9:47   ` Hendrik Tews
2006-10-26 14:54     ` Markus Mottl
2006-12-15 10:56 ` Hendrik Tews

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).