From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105])
	by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id p2Q9Bnwt026492
	for <caml-list@sympa-roc.inria.fr>; Sat, 26 Mar 2011 10:11:49 +0100
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AkcDAFKsjU3VhjEVd2dsb2JhbAAmhB+hHBQBDAoKCRQliGupFZBbgSeBTYF+dwSQSQY
X-IronPort-AV: E=Sophos;i="4.63,247,1299452400"; 
   d="scan'208";a="91256630"
Received: from ihsmtp01cons.lis.interhost.com ([213.134.49.21])
  by mail4-smtp-sop.national.inria.fr with ESMTP; 26 Mar 2011 10:11:43 +0100
Received: from [192.168.1.64] ([178.166.9.223]) by ihsmtp01cons.lis.interhost.com with Microsoft SMTPSVC(6.0.3790.3959);
	 Sat, 26 Mar 2011 09:11:42 +0000
Message-ID: <4D8DADC9.4010508@inescporto.pt>
Date: Sat, 26 Mar 2011 09:11:37 +0000
From: Hugo Ferreira <hmf@inescporto.pt>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.14) Gecko/20110223 Thunderbird/3.1.8
MIME-Version: 1.0
To: Gerd Stolpmann <info@gerd-stolpmann.de>
CC: Martin Jambon <martin.jambon@ens-lyon.org>, caml-list@inria.fr
References: <2054357367.219171.1300974318806.JavaMail.root@zmbs4.inria.fr>	 <4D8BD02D.1010505@inria.fr>  <4D8C73C8.6020801@inescporto.pt>	 <1301055903.8429.314.camel@thinkpad>	 <341494683.237537.1301057887481.JavaMail.root@zmbs4.inria.fr>	 <4D8C944A.9060601@inria.fr> <4D8CB859.9040709@inescporto.pt>	 <4D8CDDCC.4010000@ens-lyon.org>  <4D8CEAA4.2030403@inescporto.pt> <1301084818.8429.435.camel@thinkpad>
In-Reply-To: <1301084818.8429.435.camel@thinkpad>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-OriginalArrivalTime: 26 Mar 2011 09:11:42.0435 (UTC) FILETIME=[D27E7B30:01CBEB95]
Subject: Re: [Caml-list] Efficient OCaml multicore -- roadmap?

On 03/25/2011 08:26 PM, Gerd Stolpmann wrote:
> Am Freitag, den 25.03.2011, 19:19 +0000 schrieb Hugo Ferreira:
>> On 03/25/2011 06:24 PM, Martin Jambon wrote:
>>>> On 03/25/2011 01:10 PM, Fabrice Le Fessant wrote:
>>>>>     Of course, sharing structured mutable data between threads will not be
>>>>> possible, but actually, it is a good thing if you want to write correct
>>>>> programs ;-)
>>>
>>> On 03/25/11 08:44, Hugo Ferreira replied:
>>>> I'll stick to my guns here. It simply makes solving certain problem
>>>> unfeasible. Point in case: I work on machine learning algorithms. I
>>>> use large data-structures that must be processed (altered)
>>>> in order to learn. Because these data-structures are large it become
>>>> impractical to copy this to a process every time I start off a new
>>>> "thread".
>>>
>>> The solution would be to use get/set via a message-passing interface.
>>>
>>
>> Cannot see how this works. Say I want to share a balanced binary tree.
>> Several processes/threads each take this tree and alter it by adding and
>> deleting elements. Each (new) tree is then further processed by other
>> processes/threads.
>>
>> How can get/set be used in this scenario?
>>
>>>>  From my purely speculative perspective, it seems unavoidable that
>>> message-passing happens at some level in order to keep a shared data
>>> structure in sync between a large number of processors. In other words,
>>> any access to a shared data structure requires some physical copy no
>>> matter what the programming language makes it look like.
>>>
>>
>> I assume you are referring to multi-processing were memory is not shared
>> amongst CPU's, correct?
>
> This is quite normal nowadays when you have several CPUs in a (server)
> system. Each CPU gets its own cache, or even its own bank of RAM (e.g.
> all Opterons have this). And there is indeed message passing on the
> hardware level (HyperTransport, QuickPath, or Infiniband). For the
> software, it still looks as if memory was uniform (cache coherency), but
> under the hood messages are exchanged to get this effect (or even RAM
> accesses are routed over the CPU interconnect).
>
> The messages are only noticeable from software as speed degradation when
> you access RAM in the wrong way. E.g. you can see this when you
> read/modify/write the same memory cell in a loop from two threads
> running on two cores. This is a lot slower than if only a single core
> did this because the two cores constantly exchange messages and have to
> wait until the delivery of the message is done (this is also known as
> cache line bouncing).
>
> When you implement a message passing API for a high level language, the
> way to go is to provide message buffers in memory. When a thread
> delivers a message it just writes to the buffer. The real message,
> however, is sent by the hardware under the hood, and must be delivered
> until the reading thread synchronizes. The point is here, IMHO, that you
> exploit the features of the hardware the most when you do it in a way
> the hardware can deal best with. Thus a program written against the
> message passing API will finally be faster than one that uncritically
> assumes a uniform memory, and modifies memory in-place, and will finally
> suffer from cache line bouncing.
>
> For current standard server hardware there are usually not enough CPUs
> to get a big effect. But in a few years it will be very important (as it
> is for current supercomputers), maybe even for standard 128 core laptops
> in 2015 (just guessing).
>

Thanks for the info.


Hugo

> Gerd
>
>
>>
>> Hugo F.
>>
>>
>>>
>>> Martin
>>>
>>
>>
>
>