caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] poor man's allocation profiler
@ 2011-08-07 12:07 ygrek
  0 siblings, 0 replies; only message in thread
From: ygrek @ 2011-08-07 12:07 UTC (permalink / raw)
  To: caml-list

Hello,

  I would like to share a simple trick that can be used to find
allocation hot spots in ocaml programs. The idea is the same as
in "poor man's profiler"[1] (by the way it is perfectly applicable
and useful for profiling ocaml programs, see also pmp script[2]),
and shares its advantages:
 - no recompilation needed
 - simple, understandable and universally available
and shortcomings:
 - launching gdb many times is rather slow
 - sampling influences performance of target process

  We know that caml_call_gc is usually called from allocation functions
(when minor heap is exhausted). Putting breakpoint at caml_call_gc
and recording stack traces when breakpoint is hit, then grouping identical
traces after long run will show the functions that call gc most often,
hence do most allocations!

  Unfortunately gdb cannot properly unwind ocaml stack traces because
ocamlopt doesn't employ frame pointer and doesn't provide static stack
unwind information. As a result, on x86_64 stacks are shown in full,
but with lots of garbage, on 32-bit - usually truncated after a couple of
frames. But fear not :) - there exists a patch for ocamlopt to 
produce stack unwind information[3] and after recompiling one can enjoy
full stack traces in gdb! Patch doesn't break binary compatibility and so
there is no need to recompile all used libraries (but that is advisable to
get maximum coverage).

  Finally, the spell :

for i in $(echo {1..100}); do\
 gdb -p PID -batch -n -q -ex 'b caml_call_gc' -ex 'c' -ex 'bt'\
 |awk '/^#[^0]/ {print $4}' |uniq |paste -s -d , ;\
done |sort |uniq -c |sort -nr

  Let's test this method on some large and hairy ocaml codebase, e.g. mldonkey[4] :)
  (Output is truncated manually for readability)

  Original : 

     41 camlList__rev_append_1051,camlList2__cut_rec_1071,camlCommonFile__fun_2610,camlWeak__iter_bucket_1117,camlArray__iter_1101 ...
     33 camlList2__cut_rec_1071,camlCommonFile__fun_2610,camlWeak__iter_bucket_1117,camlArray__iter_1101,camlBasicSocket__exec_timers_1547 ...
      6 camlBasicSocket__iter_task_1504,camlBasicSocket__loop_1550,camlCommonMain__entry,caml_program,caml_start_program,caml_main,main
      2 camlBasicSocket__iter_timer_1509,camlBasicSocket__loop_1550,camlCommonMain__entry,caml_program,caml_start_program,caml_main,main
      2 camlBasicSocket__f_1523,camlBasicSocket__exec_timers_1547,camlBasicSocket__loop_1550 ...
      [...]

  Ouch, 75% of allocations happen in single place, after some investigation
it appears to be really inefficient implementation of sliding window used
in a tight loop. After switching to Queue we get :

     30 camlBasicSocket__iter_task_1504,camlBasicSocket__loop_1550,camlCommonMain__entry,caml_program,caml_start_program,caml_main,main
     10 camlBasicSocket__iter_timer_1509,camlBasicSocket__loop_1550,camlCommonMain__entry,caml_program,caml_start_program,caml_main,main
      7 camlCommonSwarming__fun_4928,camlBitv__fun_1554,camlArray__iteri_1111,camlCommonSwarming__set_uploader_intervals_3030 ...
      6 camlCommonSwarming__intervals_to_string_3017,camlCommonSwarming__set_uploader_intervals_3030,camlBTClients__client_to_client_2653 ...
      4 camlPrintf__scan_format_1278,camlString2__unsplit_1082,camlFilename2__normalize_1033,camlCommonShared__new_shared_1681,camlCommonShared__fun_1932 ...
      [...]

  Fixing iter_timer is simple :

     23 camlBasicSocket__iter_task_1504,camlBasicSocket__loop_1558,camlCommonMain__entry,caml_program,caml_start_program,caml_main,main
      7 camlBasicSocket__loop_1518,camlBasicSocket__loop_1558,camlCommonMain__entry,caml_program,caml_start_program,caml_main,main
      4 camlPrintf__scan_format_1278,camlString2__unsplit_1082,camlFilename2__normalize_1033,camlCommonShared__new_shared_1681 ...
      2 camlPrintf__scan_format_1278,camlString2__unsplit_1082,camlFilename2__normalize_1033,camlCommonComplexOptions__incoming_dir_1983 ...
      2 camlPrintf__scan_format_1278,camlList__iter_1074,camlBencode__encode_1085,camlList__iter_1074,camlBencode__encode_1085 ...
      [...]

  The only remaining big hot spot is iter_task, but it requires larger
changes, so I'll leave it out. The majority of allocations now are spread
accross whole program. As a result - allocation rate reduced by 25%.

[1]: http://poormansprofiler.org/
]2]: http://aspersa.googlecode.com/svn/trunk/pmp
[3]: http://caml.inria.fr/mantis/view.php?id=5314
[4]: http://mldonkey.sf.net/

Hope somebody finds it useful :)

-- 
 ygrek
 http://ygrek.org.ua

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2011-08-07 12:08 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-07 12:07 [Caml-list] poor man's allocation profiler ygrek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).