caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Average cost of the OCaml GC
@ 2010-11-11  3:59 Jianzhou Zhao
  2010-11-11  9:08 ` [Caml-list] " Goswin von Brederlow
  0 siblings, 1 reply; 8+ messages in thread
From: Jianzhou Zhao @ 2010-11-11  3:59 UTC (permalink / raw)
  To: caml-list

Hi,

What is the average cost of the OCaml GC? I have a program that calls
'mark_slice' in 57% of the total execution time, and calls
'sweep_slice' in 21% of the total time, reported by Callgrind, which
is a profiling tool in Valgrind. 57% and 21% are the 'self cost' ---
the cost of the function itself ('Self Cost'), rather than the cost
including all called functions ('Inclusive Cost'). I guess
'mark_slice'  and  'sweep_slice'  are functions from OCaml GC. Are
these numbers normal?

My program calls both OCaml and C, which passes around C data types in
between. I also doubt if I defined the interface in an 'unefficient'
way that slows down the GC. Are there any rules in mind to make GC
work more efficiently?

Thanks.
-- 
Jianzhou


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Average cost of the OCaml GC
  2010-11-11  3:59 Average cost of the OCaml GC Jianzhou Zhao
@ 2010-11-11  9:08 ` Goswin von Brederlow
  2010-11-11 13:52   ` Jianzhou Zhao
  0 siblings, 1 reply; 8+ messages in thread
From: Goswin von Brederlow @ 2010-11-11  9:08 UTC (permalink / raw)
  To: Jianzhou Zhao; +Cc: caml-list

Jianzhou Zhao <jianzhou@seas.upenn.edu> writes:

> Hi,
>
> What is the average cost of the OCaml GC? I have a program that calls
> 'mark_slice' in 57% of the total execution time, and calls
> 'sweep_slice' in 21% of the total time, reported by Callgrind, which
> is a profiling tool in Valgrind. 57% and 21% are the 'self cost' ---
> the cost of the function itself ('Self Cost'), rather than the cost
> including all called functions ('Inclusive Cost'). I guess
> 'mark_slice'  and  'sweep_slice'  are functions from OCaml GC. Are
> these numbers normal?

Those numbers sound rather high to me.

> My program calls both OCaml and C, which passes around C data types in
> between. I also doubt if I defined the interface in an 'unefficient'
> way that slows down the GC. Are there any rules in mind to make GC
> work more efficiently?

You can tune some of the GC parameters to suit your use case.

Do you allocate custom types from C? In caml_alloc_custom(ops, size,
used, max) the used and max do influence the GC how often to run. If
you set them wrong you might trigger the GC too often.

MfG
        Goswin


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Average cost of the OCaml GC
  2010-11-11  9:08 ` [Caml-list] " Goswin von Brederlow
@ 2010-11-11 13:52   ` Jianzhou Zhao
  2010-11-11 14:14     ` Michael Ekstrand
  2010-11-11 20:11     ` Goswin von Brederlow
  0 siblings, 2 replies; 8+ messages in thread
From: Jianzhou Zhao @ 2010-11-11 13:52 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: caml-list

On Thu, Nov 11, 2010 at 4:08 AM, Goswin von Brederlow <goswin-v-b@web.de> wrote:
> Jianzhou Zhao <jianzhou@seas.upenn.edu> writes:
>
>> Hi,
>>
>> What is the average cost of the OCaml GC? I have a program that calls
>> 'mark_slice' in 57% of the total execution time, and calls
>> 'sweep_slice' in 21% of the total time, reported by Callgrind, which
>> is a profiling tool in Valgrind. 57% and 21% are the 'self cost' ---
>> the cost of the function itself ('Self Cost'), rather than the cost
>> including all called functions ('Inclusive Cost'). I guess
>> 'mark_slice'  and  'sweep_slice'  are functions from OCaml GC. Are
>> these numbers normal?
>
> Those numbers sound rather high to me.
>
>> My program calls both OCaml and C, which passes around C data types in
>> between. I also doubt if I defined the interface in an 'unefficient'
>> way that slows down the GC. Are there any rules in mind to make GC
>> work more efficiently?
>
> You can tune some of the GC parameters to suit your use case.
>
> Do you allocate custom types from C? In caml_alloc_custom(ops, size,
> used, max) the used and max do influence the GC how often to run.

Yes. The code uses caml_alloc_custom to create a lot of small objects
(less then 8 bytes) frequently. The used and max are set to be
default, 0 and 1. The manual says
  http://caml.inria.fr/pub/docs/manual-ocaml/manual032.html#toc140

/////////////////////
If your finalized blocks contain no pointers to out-of-heap resources,
or if the previous discussion made little sense to you, just take used
= 0 and max = 1. But if you later find that the finalization functions
are not called “often enough”, consider increasing the used / max
ratio.
//////////////////////

Does this mean the default used and max let GC do finalization 'as
slow as possible'? This does not seem to be the case if the costs 57%
and 20% are too high.

> If you set them wrong you might trigger the GC too often.

In which case could they be set 'wrong'? For example, if 'used' is not
equal to the real amount of allocated data; or is there a range of
'max' given a used?

>
> MfG
>        Goswin
>



-- 
Jianzhou


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Average cost of the OCaml GC
  2010-11-11 13:52   ` Jianzhou Zhao
@ 2010-11-11 14:14     ` Michael Ekstrand
  2010-11-11 20:11     ` Goswin von Brederlow
  1 sibling, 0 replies; 8+ messages in thread
From: Michael Ekstrand @ 2010-11-11 14:14 UTC (permalink / raw)
  To: caml-list

On 11/11/2010 07:52 AM, Jianzhou Zhao wrote:
> On Thu, Nov 11, 2010 at 4:08 AM, Goswin von Brederlow <goswin-v-b@web.de> wrote:
>> Jianzhou Zhao <jianzhou@seas.upenn.edu> writes:
>>
>>> Hi,
>>>
>>> What is the average cost of the OCaml GC? I have a program that calls
>>> 'mark_slice' in 57% of the total execution time, and calls
>>> 'sweep_slice' in 21% of the total time, reported by Callgrind, which
>>> is a profiling tool in Valgrind. 57% and 21% are the 'self cost' ---
>>> the cost of the function itself ('Self Cost'), rather than the cost
>>> including all called functions ('Inclusive Cost'). I guess
>>> 'mark_slice'  and  'sweep_slice'  are functions from OCaml GC. Are
>>> these numbers normal?
>>
>> Those numbers sound rather high to me.

They sound high to me as well, but not unheard of - I sometimes measure
a lot of time in the GC.

>>> My program calls both OCaml and C, which passes around C data types in
>>> between. I also doubt if I defined the interface in an 'unefficient'
>>> way that slows down the GC. Are there any rules in mind to make GC
>>> work more efficiently?
>>
>> You can tune some of the GC parameters to suit your use case.
>>
>> Do you allocate custom types from C? In caml_alloc_custom(ops, size,
>> used, max) the used and max do influence the GC how often to run.
> 
> Yes. The code uses caml_alloc_custom to create a lot of small objects
> (less then 8 bytes) frequently. The used and max are set to be
> default, 0 and 1. The manual says
>   http://caml.inria.fr/pub/docs/manual-ocaml/manual032.html#toc140
> 
> /////////////////////
> If your finalized blocks contain no pointers to out-of-heap resources,
> or if the previous discussion made little sense to you, just take used
> = 0 and max = 1. But if you later find that the finalization functions
> are not called “often enough”, consider increasing the used / max
> ratio.
> //////////////////////
> 
> Does this mean the default used and max let GC do finalization 'as
> slow as possible'? This does not seem to be the case if the costs 57%
> and 20% are too high.

Yes, with respect to GC cycles triggered by "too much" custom data
allocation.

There are a variety of things that can cause GC thrashing.  One of them
is the GC "space overhead" parameter, which controls how aggressive the
GC is at reclaiming memory.  Another is your minor heap size - if your
minor heap is too small, it can cause excess GC activity.  I documented
the parameter tuning I have done to reduce GC cost on my blog[1], but
here's a short summary:

 * Increase minor heap size.  I usually use 1M or 4M words; my general
rule of thumb is that I want one "work unit" with its temporary storage
requirements to fit in a minor heap.  This decreases the frequency both
of minor collections and major slices.
 * Increase space_overhead; I increase this to 100 or 200 (the default
is 80), as I typically run my large codes on machines with lots of spare
RAM and can accept a space-speed tradeoff.
 * Increase the heap increment.  If your process will require lots of
RAM, this lets it allocate that memory in bigger chunks further
decreasing the memory overhead.

I also use a patched Bigarray that allows me to set the "max" parameter
it uses in its invocations of caml_alloc_custom, but if you are not
using bigarray that shouldn't be impacting your program's performance.
It's quite critical when allocating large bigarrays, though!  Having
custom blocks allocated near or above the "max" param is a sure-fire
recipe for GC thrashing.  It sounds like you're avoiding that pitfall,
though.

So, the short short story: you're doing many of the right things
(measuring, not letting custom allocations thrash the GC).  Some more
parameter tuning will hopefully help you decrease your GC overhead.

1. http://elehack.net/michael/blog/2010/06/ocaml-memory-tuning

- Michael


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Average cost of the OCaml GC
  2010-11-11 13:52   ` Jianzhou Zhao
  2010-11-11 14:14     ` Michael Ekstrand
@ 2010-11-11 20:11     ` Goswin von Brederlow
  2010-11-12 17:27       ` Jianzhou Zhao
  1 sibling, 1 reply; 8+ messages in thread
From: Goswin von Brederlow @ 2010-11-11 20:11 UTC (permalink / raw)
  To: Jianzhou Zhao; +Cc: Goswin von Brederlow, caml-list

Jianzhou Zhao <jianzhou@seas.upenn.edu> writes:

> On Thu, Nov 11, 2010 at 4:08 AM, Goswin von Brederlow <goswin-v-b@web.de> wrote:
>> Jianzhou Zhao <jianzhou@seas.upenn.edu> writes:
>>
>>> Hi,
>>>
>>> What is the average cost of the OCaml GC? I have a program that calls
>>> 'mark_slice' in 57% of the total execution time, and calls
>>> 'sweep_slice' in 21% of the total time, reported by Callgrind, which
>>> is a profiling tool in Valgrind. 57% and 21% are the 'self cost' ---
>>> the cost of the function itself ('Self Cost'), rather than the cost
>>> including all called functions ('Inclusive Cost'). I guess
>>> 'mark_slice'  and  'sweep_slice'  are functions from OCaml GC. Are
>>> these numbers normal?
>>
>> Those numbers sound rather high to me.
>>
>>> My program calls both OCaml and C, which passes around C data types in
>>> between. I also doubt if I defined the interface in an 'unefficient'
>>> way that slows down the GC. Are there any rules in mind to make GC
>>> work more efficiently?
>>
>> You can tune some of the GC parameters to suit your use case.
>>
>> Do you allocate custom types from C? In caml_alloc_custom(ops, size,
>> used, max) the used and max do influence the GC how often to run.
>
> Yes. The code uses caml_alloc_custom to create a lot of small objects
> (less then 8 bytes) frequently. The used and max are set to be
> default, 0 and 1. The manual says
>   http://caml.inria.fr/pub/docs/manual-ocaml/manual032.html#toc140
>
> /////////////////////
> If your finalized blocks contain no pointers to out-of-heap resources,
> or if the previous discussion made little sense to you, just take used
> = 0 and max = 1. But if you later find that the finalization functions
> are not called “often enough”, consider increasing the used / max
> ratio.
> //////////////////////
>
> Does this mean the default used and max let GC do finalization 'as
> slow as possible'? This does not seem to be the case if the costs 57%
> and 20% are too high.

I think 0/1 gives you the least amount of GC runs.

>> If you set them wrong you might trigger the GC too often.
>
> In which case could they be set 'wrong'? For example, if 'used' is not
> equal to the real amount of allocated data; or is there a range of
> 'max' given a used?

A used = 1000000 would be wrong here. Your 0/1 setting look fine to me.

MfG
        Goswin


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Average cost of the OCaml GC
  2010-11-11 20:11     ` Goswin von Brederlow
@ 2010-11-12 17:27       ` Jianzhou Zhao
  2010-11-12 21:54         ` ygrek
  2010-11-16 10:02         ` Goswin von Brederlow
  0 siblings, 2 replies; 8+ messages in thread
From: Jianzhou Zhao @ 2010-11-12 17:27 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: caml-list

On Thu, Nov 11, 2010 at 3:11 PM, Goswin von Brederlow <goswin-v-b@web.de> wrote:
> Jianzhou Zhao <jianzhou@seas.upenn.edu> writes:
>
>> On Thu, Nov 11, 2010 at 4:08 AM, Goswin von Brederlow <goswin-v-b@web.de> wrote:
>>> Jianzhou Zhao <jianzhou@seas.upenn.edu> writes:
>>>
>>>> Hi,
>>>>
>>>> What is the average cost of the OCaml GC? I have a program that calls
>>>> 'mark_slice' in 57% of the total execution time, and calls
>>>> 'sweep_slice' in 21% of the total time, reported by Callgrind, which
>>>> is a profiling tool in Valgrind. 57% and 21% are the 'self cost' ---
>>>> the cost of the function itself ('Self Cost'), rather than the cost
>>>> including all called functions ('Inclusive Cost'). I guess
>>>> 'mark_slice'  and  'sweep_slice'  are functions from OCaml GC. Are
>>>> these numbers normal?
>>>
>>> Those numbers sound rather high to me.
>>>
>>>> My program calls both OCaml and C, which passes around C data types in
>>>> between. I also doubt if I defined the interface in an 'unefficient'
>>>> way that slows down the GC. Are there any rules in mind to make GC
>>>> work more efficiently?
>>>
>>> You can tune some of the GC parameters to suit your use case.
>>>
>>> Do you allocate custom types from C? In caml_alloc_custom(ops, size,
>>> used, max) the used and max do influence the GC how often to run.
>>
>> Yes. The code uses caml_alloc_custom to create a lot of small objects
>> (less then 8 bytes) frequently. The used and max are set to be
>> default, 0 and 1. The manual says
>>   http://caml.inria.fr/pub/docs/manual-ocaml/manual032.html#toc140
>>
>> /////////////////////
>> If your finalized blocks contain no pointers to out-of-heap resources,
>> or if the previous discussion made little sense to you, just take used
>> = 0 and max = 1. But if you later find that the finalization functions
>> are not called “often enough”, consider increasing the used / max
>> ratio.
>> //////////////////////
>>
>> Does this mean the default used and max let GC do finalization 'as
>> slow as possible'? This does not seem to be the case if the costs 57%
>> and 20% are too high.
>
> I think 0/1 gives you the least amount of GC runs.
>
>>> If you set them wrong you might trigger the GC too often.
>>
>> In which case could they be set 'wrong'? For example, if 'used' is not
>> equal to the real amount of allocated data; or is there a range of
>> 'max' given a used?
>
> A used = 1000000 would be wrong here. Your 0/1 setting look fine to me.

Do we still have other methods to debug such problems? Is it possible
to know when and where GC runs, say, the number of times GC works
after a particular usr-defined function? If this is possible, I was
wondering if we can see which function in my code behave wrong.

>
> MfG
>        Goswin
>



-- 
Jianzhou


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Average cost of the OCaml GC
  2010-11-12 17:27       ` Jianzhou Zhao
@ 2010-11-12 21:54         ` ygrek
  2010-11-16 10:02         ` Goswin von Brederlow
  1 sibling, 0 replies; 8+ messages in thread
From: ygrek @ 2010-11-12 21:54 UTC (permalink / raw)
  To: caml-list

On Fri, 12 Nov 2010 12:27:40 -0500
Jianzhou Zhao <jianzhou@seas.upenn.edu> wrote:

> Do we still have other methods to debug such problems? Is it possible
> to know when and where GC runs, say, the number of times GC works
> after a particular usr-defined function? If this is possible, I was
> wondering if we can see which function in my code behave wrong.

Below is straghtforward "GC diffing" code which helps me to pinpoint excessive GC 
(like the ExtLib.String.nsplit in example).

$ cat a.ml 

open Printf
open Gc

let bytes_string_f f = (* oh ugly *)
  let a = abs_float f in
  if a < 1024. then sprintf "%dB" (int_of_float f) else
  if a < 1024. *. 1024. then sprintf "%dKB" (int_of_float (f /. 1024.)) else
  if a < 1024. *. 1024. *. 1024. then sprintf "%.1fMB" (f /. 1024. /. 1024.) else
  sprintf "%.1fGB" (f /. 1024. /. 1024. /. 1024.)

let bytes_string x = bytes_string_f (float_of_int x)

let caml_words_f f =
  bytes_string_f (f *. (float_of_int (Sys.word_size / 8)))

let caml_words x = caml_words_f (float_of_int x)

let gc_diff st1 st2 =
  let allocated st = st.minor_words +. st.major_words -. st.promoted_words in
  let a = allocated st2 -. allocated st1 in
  let minor = st2.minor_collections - st1.minor_collections in
  let major = st2.major_collections - st1.major_collections in
  let compact = st2.compactions - st1. compactions in
  let heap = st2.heap_words - st1.heap_words in
  sprintf "allocated %10s, heap %10s, collection %d %d %d" (caml_words_f a) (caml_words heap) compact major minor

let gc_show name f x =
  let st = Gc.quick_stat () in
  Std.finally (fun () -> let st2 = Gc.quick_stat () in 
    eprintf "GC DIFF %s : %s\n" name (gc_diff st st2)) f x

let () =
  let _ = gc_show "split" (ExtLib.String.nsplit (String.make 10000 'a')) "a" in
  gc_show "compact" Gc.compact ()

$ ocamlfind ocamlopt -linkpkg -package extlib a.ml -o a
$ ./a 
GC DIFF split : allocated     48.1MB, heap     48.0MB, collection 0 21 373
GC DIFF compact : allocated       240B, heap    -48.0MB, collection 1 2 0

-- 
 ygrek
 http://ygrek.org.ua


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Average cost of the OCaml GC
  2010-11-12 17:27       ` Jianzhou Zhao
  2010-11-12 21:54         ` ygrek
@ 2010-11-16 10:02         ` Goswin von Brederlow
  1 sibling, 0 replies; 8+ messages in thread
From: Goswin von Brederlow @ 2010-11-16 10:02 UTC (permalink / raw)
  To: Jianzhou Zhao; +Cc: caml-list

Jianzhou Zhao <jianzhou@seas.upenn.edu> writes:

> On Thu, Nov 11, 2010 at 3:11 PM, Goswin von Brederlow <goswin-v-b@web.de> wrote:
>> Jianzhou Zhao <jianzhou@seas.upenn.edu> writes:
>>
>>> On Thu, Nov 11, 2010 at 4:08 AM, Goswin von Brederlow <goswin-v-b@web.de> wrote:
>>>> Jianzhou Zhao <jianzhou@seas.upenn.edu> writes:
>>>>
>>>>> Hi,
>>>>>
>>>>> What is the average cost of the OCaml GC? I have a program that calls
>>>>> 'mark_slice' in 57% of the total execution time, and calls
>>>>> 'sweep_slice' in 21% of the total time, reported by Callgrind, which
>>>>> is a profiling tool in Valgrind. 57% and 21% are the 'self cost' ---
>>>>> the cost of the function itself ('Self Cost'), rather than the cost
>>>>> including all called functions ('Inclusive Cost'). I guess
>>>>> 'mark_slice'  and  'sweep_slice'  are functions from OCaml GC. Are
>>>>> these numbers normal?
>>>>
>>>> Those numbers sound rather high to me.
>>>>
>>>>> My program calls both OCaml and C, which passes around C data types in
>>>>> between. I also doubt if I defined the interface in an 'unefficient'
>>>>> way that slows down the GC. Are there any rules in mind to make GC
>>>>> work more efficiently?
>>>>
>>>> You can tune some of the GC parameters to suit your use case.
>>>>
>>>> Do you allocate custom types from C? In caml_alloc_custom(ops, size,
>>>> used, max) the used and max do influence the GC how often to run.
>>>
>>> Yes. The code uses caml_alloc_custom to create a lot of small objects
>>> (less then 8 bytes) frequently. The used and max are set to be
>>> default, 0 and 1. The manual says
>>>   http://caml.inria.fr/pub/docs/manual-ocaml/manual032.html#toc140
>>>
>>> /////////////////////
>>> If your finalized blocks contain no pointers to out-of-heap resources,
>>> or if the previous discussion made little sense to you, just take used
>>> = 0 and max = 1. But if you later find that the finalization functions
>>> are not called “often enough”, consider increasing the used / max
>>> ratio.
>>> //////////////////////
>>>
>>> Does this mean the default used and max let GC do finalization 'as
>>> slow as possible'? This does not seem to be the case if the costs 57%
>>> and 20% are too high.
>>
>> I think 0/1 gives you the least amount of GC runs.
>>
>>>> If you set them wrong you might trigger the GC too often.
>>>
>>> In which case could they be set 'wrong'? For example, if 'used' is not
>>> equal to the real amount of allocated data; or is there a range of
>>> 'max' given a used?
>>
>> A used = 1000000 would be wrong here. Your 0/1 setting look fine to me.
>
> Do we still have other methods to debug such problems? Is it possible
> to know when and where GC runs, say, the number of times GC works
> after a particular usr-defined function? If this is possible, I was
> wondering if we can see which function in my code behave wrong.

Only the interface the GC module exposes. You can turn the GC quite
verbose.

MfG
        Goswin


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-11-16 10:02 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-11  3:59 Average cost of the OCaml GC Jianzhou Zhao
2010-11-11  9:08 ` [Caml-list] " Goswin von Brederlow
2010-11-11 13:52   ` Jianzhou Zhao
2010-11-11 14:14     ` Michael Ekstrand
2010-11-11 20:11     ` Goswin von Brederlow
2010-11-12 17:27       ` Jianzhou Zhao
2010-11-12 21:54         ` ygrek
2010-11-16 10:02         ` Goswin von Brederlow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).