From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by yquem.inria.fr (Postfix) with ESMTP id DCC01BC57 for ; Thu, 2 Dec 2010 16:08:07 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Aq0AAC9E90zRVaE0kGdsb2JhbACDT59KCBUBAQEBCQkMBxEEHqhZiSo8ghiEdC6IVgEBAwWCfwiBSHMEjl8 X-IronPort-AV: E=Sophos;i="4.59,288,1288566000"; d="scan'208";a="82183306" Received: from mail-fx0-f52.google.com ([209.85.161.52]) by mail2-smtp-roc.national.inria.fr with ESMTP; 02 Dec 2010 16:08:07 +0100 Received: by fxm5 with SMTP id 5so6673708fxm.39 for ; Thu, 02 Dec 2010 07:08:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:date:from:to:cc:subject :message-id:in-reply-to:references:x-mailer:mime-version :content-type; bh=cCDWxKDAMj1mKrol3xKMbySqkulUX7jov8RubLbmI2k=; b=aNNURsQNuYYNF2PwvP8hST+5KQS1MZCbD5YqkVZrYGKjRG+XPn/LKAjKywXwQZC6LE wwTUNmrENBJUmQITkmdvmRjC8eVdfK+/oxWJYMq2uZNGVuF5HnY6VdNx2geG9XT1VenK Tt5A6HveBKUx7d5nDT2fmU2iLUIXudVRsw9ys= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:in-reply-to:references:x-mailer :mime-version:content-type; b=xM4Pxn5q9TZxqb0/fAtDCzgVWpEiOiYuN8ToGgpIZ9hRUvK8LGkAaXinOZAEsKHW04 8uL4Pog+uK8d4GudRzNpUobj+itu8zMlmh1+Lez0X4FSH5af2j1+egcOfwauiOLgP9ae BhIGdCyEBBgK1vLnnOzuTscEMJjcmvTnvht5s= Received: by 10.223.36.220 with SMTP id u28mr715582fad.11.1291302484447; Thu, 02 Dec 2010 07:08:04 -0800 (PST) Received: from deb0 ([79.114.67.133]) by mx.google.com with ESMTPS id n3sm220634faa.5.2010.12.02.07.08.03 (version=SSLv3 cipher=RC4-MD5); Thu, 02 Dec 2010 07:08:03 -0800 (PST) Date: Thu, 2 Dec 2010 17:07:59 +0200 From: =?UTF-8?B?VMO2csO2aw==?= Edwin To: Damien Doligez Cc: OCaml mailing list Subject: Re: [Caml-list] OCaml GC [was Is OCaml fast?] Message-ID: <20101202170759.1c9feec6@deb0> In-Reply-To: <0EC166A9-4CBE-48D2-89CD-E50CD8191E11@inria.fr> References: <1290434674.16005.354.camel@thinkpad> <582306206.731582.1290438133628.JavaMail.root@zmbs4.inria.fr> <20101122203334.7adc5ee6@deb0> <20101127221121.0920db65@ordinaves.concept-micro.com> <4CF25878.4060202@univ-savoie.fr> <916556265.243293.1291069665032.JavaMail.root@zmbs1.inria.fr> <0EC166A9-4CBE-48D2-89CD-E50CD8191E11@inria.fr> X-Mailer: Claws Mail 3.7.6 (GTK+ 2.20.1; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="MP_/gV2sTXyDHTR=WcO2m9W6/ni" X-Spam: no; 0.00; ocaml:01 ocaml:01 damien:01 damien:01 wosize:01 ocaml's:01 wsize:01 bsize:01 allocations:01 expansions:01 gcc:01 ocamlopt:01 bug:01 gcc:01 bug:01 X-Attachments: cset="UTF-8" type="text/x-patch" name="gc_opt.patch" --MP_/gV2sTXyDHTR=WcO2m9W6/ni Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline On Thu, 2 Dec 2010 13:57:23 +0100 Damien Doligez wrote: >=20 > On 2010-11-29, at 23:27, T=C3=B6r=C3=B6k Edwin wrote: >=20 > > This seems to be in concordance with the "smaller minor heap =3D> more > > minor collections =3D> slower program" observation, since it is: > > smaller minor heap =3D> more minor collections =3D> more major slices = =3D> > > major slices can't collect long-lived objects =3D> slower program > > (long lived objects are sweeped too many times). > > So more minor collections =3D> higher cost for a major slice >=20 > This is a bit naive, because the size of a major slice depends on the > amount of data promoted by the corresponding minor GC. A smaller > minor heap promotes less data each time, so you get more major > slices, but they are smaller. The total amount of work done by the > major GC doesn't change because of that. It only changes because a > bigger minor heap gives more time for data to die before being > promoted to the major heap. Thanks for the explanation.=20 Is there a way I could use more than 2 bits for the color? I thought to try excluding some objects from collection if they failed to get collected for a few cycles (i.e. just mark them black from the start). Tried modifying Wosize_hd, and Make_header, but it is not enough, anything else I missed? >=20 > > I think OCaml's GC should take into account how successful the last > > major GC was (how many words it freed), and adjust speed > > accordingly: if we know that the major slice will collect next to > > nothing there is no point in wasting time and running it. > >=20 > > So this formula should probably be changed: > > p =3D (double) caml_allocated_words * 3.0 * (100 + caml_percent_free) > > / Wsize_bsize (caml_stat_heap_size) / caml_percent_free / 2.0; > >=20 > > Probably to something that also does: > > p =3D p * major_slice_successrate >=20 >=20 > The success rate is already taken into account by the major GC. In > fact a target success rate is set by the user: it is > caml_percent_free / (100 + caml_percent_free) Thanks, that is probably what I had in mind, but adjusting it doesn't work as I initially expected. > and the major GC adjusts its speed according to this target. If the > target is not met, the free list is emptied faster than the GC can > replenish it, so it gets depleted at some point and the major heap > is then extended to satisfy allocation requests. The bigger major > heap then helps the major GC meet its target, because the success rate > is simply (heap_size - live_data) / heap_size, and that gets higher > as heap_size increases. >=20 > I didn't do the math, but I think your modification would make the > major heap size increase without bound. Yeah, I've been experimenting, and unless I put a lower bound on 'p' (like half of initial one) it used way too much memory (sure it was also faster, but then it is also faster by not running the GC at all which is not good). Which parameter do you think could be automatically tuned to accomodate bursts of allocations? Maybe major_heap_increment (based on recent history of heap expansions, and promoted words)? BTW I was also looking at ways to speed up mark_slice and sweep_slice, (which is not easy since it is mostly optimized already). Found a place where GCC is not smart enough though. Attached patch improves GC sweep_slice by 3% - 10% in ocamlopt, depending on CPU (3% Intel, 10% AMD Phenom II X6). Should I open a bug and attach it? Using gcc -O3 gives would sometimes give the same benefits, but I don't really trust -O3, -O2 is as far as I would go. See here for all the details: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D46763 [speaking of which why is the GC compiled with -O, is -O2 too risky too?] P.S.: minor_heap_size got bumped in 3.12 branch, so it is now greater than major_heap_increment. Is that intended? Best regards, --Edwin --MP_/gV2sTXyDHTR=WcO2m9W6/ni Content-Type: text/x-patch Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=gc_opt.patch diff -ru /home/edwin/ocaml-3.12.0+rc1/byterun//major_gc.c ./major_gc.c --- /home/edwin/ocaml-3.12.0+rc1/byterun//major_gc.c 2009-11-04 14:25:47.000000000 +0200 +++ ./major_gc.c 2010-12-02 17:02:38.000000000 +0200 @@ -286,21 +286,25 @@ { char *hp; header_t hd; + /* speed opt: keep global in local var */ + char *gc_sweep_hp = caml_gc_sweep_hp; caml_gc_message (0x40, "Sweeping %ld words\n", work); while (work > 0){ - if (caml_gc_sweep_hp < limit){ - hp = caml_gc_sweep_hp; + if (gc_sweep_hp < limit){ + hp = gc_sweep_hp; hd = Hd_hp (hp); work -= Whsize_hd (hd); - caml_gc_sweep_hp += Bhsize_hd (hd); + gc_sweep_hp += Bhsize_hd (hd); + PREFETCH_READ_NT(gc_sweep_hp); switch (Color_hd (hd)){ case Caml_white: + caml_gc_sweep_hp = gc_sweep_hp; if (Tag_hd (hd) == Custom_tag){ void (*final_fun)(value) = Custom_ops_val(Val_hp(hp))->finalize; if (final_fun != NULL) final_fun(Val_hp(hp)); } - caml_gc_sweep_hp = caml_fl_merge_block (Bp_hp (hp)); + gc_sweep_hp = caml_fl_merge_block (Bp_hp (hp)); break; case Caml_blue: /* Only the blocks of the free-list are blue. See [freelist.c]. */ @@ -311,7 +315,7 @@ Hd_hp (hp) = Whitehd_hd (hd); break; } - Assert (caml_gc_sweep_hp <= limit); + Assert (gc_sweep_hp <= limit); }else{ chunk = Chunk_next (chunk); if (chunk == NULL){ @@ -320,11 +324,12 @@ work = 0; caml_gc_phase = Phase_idle; }else{ - caml_gc_sweep_hp = chunk; + gc_sweep_hp = chunk; limit = chunk + Chunk_size (chunk); } } } + caml_gc_sweep_hp = gc_sweep_hp; } diff -ru /home/edwin/ocaml-3.12.0+rc1/byterun//memory.h ./memory.h --- /home/edwin/ocaml-3.12.0+rc1/byterun//memory.h 2008-12-03 20:09:09.000000000 +0200 +++ ./memory.h 2010-12-02 17:06:12.000000000 +0200 @@ -215,6 +215,16 @@ #define CAMLunused #endif +/* non-temporal prefetch for read (it need not be left in the cache after the access) */ +#if defined (__GNUC__) && (__GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ > 1)) + #define PREFETCH_READ_NT(addr) __builtin_prefetch((addr), 0, 0) + #define PREFETCH_READ(addr) __builtin_prefetch((addr), 0, 3) +#else + #define PREFETCH_READ_NT(addr) + #define PREFETCH_READ(addr) +#endif + + #define CAMLxparam1(x) \ struct caml__roots_block caml__roots_##x; \ CAMLunused int caml__dummy_##x = ( \ --MP_/gV2sTXyDHTR=WcO2m9W6/ni--