From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <philippe.wang@lip6.fr>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled 
	version=3.1.3
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by yquem.inria.fr (Postfix) with ESMTP id 37B31BBAF
	for <caml-list@yquem.inria.fr>; Thu, 24 Sep 2009 18:31:31 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Ag4CABw8u0qE4zwCmWdsb2JhbACafwEBAQEBCAsKBxO8QoQbBQ
X-IronPort-AV: E=Sophos;i="4.44,446,1249250400"; 
   d="scan'208";a="34925056"
Received: from isis.lip6.fr ([132.227.60.2])
  by mail3-smtp-sop.national.inria.fr with ESMTP; 24 Sep 2009 18:31:05 +0200
Received: from tibre.lip6.fr (tibre.lip6.fr [132.227.74.2])
          by isis.lip6.fr (8.14.3/lip6) with ESMTP id n8OGUvU0013882
          ; Thu, 24 Sep 2009 18:30:57 +0200 (CEST)
X-pt: isis.lip6.fr
Received: from [127.0.0.1] (micmac [132.227.83.124])
	by tibre.lip6.fr (8.14.3/8.13.3) with ESMTP id n8OGUvtr005476;
	Thu, 24 Sep 2009 18:30:57 +0200 (MEST)
Subject: Re: [Caml-list] OC4MC : OCaml for Multicore architectures
Mime-Version: 1.0 (Apple Message framework v1076)
Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes
From: Philippe Wang <philippe.wang@lip6.fr>
In-Reply-To: <5D7DF789-5EE6-4432-92A2-673F2D02FCD7@cea.fr>
Date: Thu, 24 Sep 2009 18:30:56 +0200
Cc: caml-list@yquem.inria.fr
Content-Transfer-Encoding: 7bit
Message-Id: <95394D43-BA68-4BCE-9151-55EC22C42742@lip6.fr>
References: <20090924154716.BCD0ABC5A@yquem.inria.fr> <5D7DF789-5EE6-4432-92A2-673F2D02FCD7@cea.fr>
To: Pascal Cuoq <Pascal.Cuoq@cea.fr>
X-Mailer: Apple Mail (2.1076)
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.2 (isis.lip6.fr [132.227.60.2]); Thu, 24 Sep 2009 18:30:57 +0200 (CEST)
X-Scanned-By: MIMEDefang 2.63 on 132.227.60.2
X-Spam: no; 0.00; ocaml:01 cuoq:01 subset:01 pointers:01 mutable:01 threads:01 threads:01 wrote:01 wrote:01 stack:01 heap:01 heap:01 caml-list:01 gcs:01 algorithm:01 


On Sep 24, 2009, at 18:02 GMT+02:00, Pascal Cuoq wrote:

> On Sep 24, 2009, at 5:47 PM, Philippe Wang wrote:
>
>>> Is the copy operation parallelized?
>>
>> Nope. When the world is stopped for the collection, everything is  
>> done
>> sequentially until the world is resumed.
>> I don't think it's relevant to parallelize the copy operation (hell  
>> to
>> implement&debug, then I don't think that performance would be very
>> interesting because we would probably need a write mutex on the
>> destination heap)
>
> Well, you could start copying to the bottom of the next heap with
> one thread going up and to the top of it with another going down.
> Assume optimistically that the two threads will not reach the same
> cacheline at the end of the copies, and you don't need any
> synchronisation at all between them, except joining at the end.
>
> After checking, if they have reached the same cacheline,
> you need to reallocate the destination heap anyway.
>
> You still get a single unfragmented free block as a result.
>
> Even better: stop the world just before there remains less that one
> cacheline of free space and you don't need to check if the two  
> threads have
> met. You still need to reallocate the destination heap sometimes  
> though.

A concurrent copy means that there would be bad overhead for single  
core. It also means putting bottleneck to memory bandwidth as memory  
copy operations are clearly quickly limited by this bandwidth, not by  
CPU. It may hopefully become false in a few years, but hardware  
manufacturers don't seem to be excited by that, they seem to prefer  
making the marketing on the number of cores. Look at GPUs : they have  
very fast graphical RAM, but they have a huge number of processing  
units. I don't really see the point in that (i.e. having a huge number  
of PU) anyway (except "marketing").

Ok, back to GC stuff. A stop&copy algorithm needs to have a set of  
roots to make the copy of living values.
Each thread has its stack, so it has its subset of roots. Then what ?  
Parallelize the copy from each thread ? Ok we have to determine the  
best number of threads according to number of cores but more  
importantly according to memory bandwidth given per core. (what a  
nightmare!)
Then there are shared values (in the shared heap for instance, but  
what if there are lateral pointers due to mutable values?). (We are  
leaving the nightmare for hell! but some people have been there.)  
Copying a living value means that if later you encounter something  
pointing to its old address, you have to know the new one. This means  
writing at the old address. I don't see how we can make *today*  
something very interesting in concurrent with a stop&copy algorithm. I  
believe (but I'm *not* a GC expert at all) concurrent GCs are not  
based on stop&copy algorithm but rather some mark&{do-some-stuff-such- 
as-"sweep"}.


> Oh, and I meant to say, but everyone else was faster than me:
> well done!

Thank you, and thanks everyone else who appreciate this work. :-)

Philippe Wang