From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105])
	by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id p2QAWjZp028957
	for <caml-list@sympa-roc.inria.fr>; Sat, 26 Mar 2011 11:32:45 +0100
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AvsEAEfAjU1QRFuw/2dsb2JhbAClYXfCe4MPgloE
X-IronPort-AV: E=Sophos;i="4.63,247,1299452400"; 
   d="scan'208";a="91271062"
Received: from furbychan.cocan.org ([80.68.91.176])
  by mail4-smtp-sop.national.inria.fr with ESMTP/TLS/AES256-SHA; 26 Mar 2011 11:32:00 +0100
Received: from rich by furbychan.cocan.org with local (Exim 4.72)
	(envelope-from <rich@annexia.org>)
	id 1Q3Qmb-0001yq-9q; Sat, 26 Mar 2011 10:31:49 +0000
Date: Sat, 26 Mar 2011 10:31:49 +0000
From: "Richard W.M. Jones" <rich@annexia.org>
To: Gerd Stolpmann <info@gerd-stolpmann.de>
Cc: Hugo Ferreira <hmf@inescporto.pt>,
        Martin Jambon <martin.jambon@ens-lyon.org>, caml-list@inria.fr
Message-ID: <20110326103149.GA7467@annexia.org>
References: <2054357367.219171.1300974318806.JavaMail.root@zmbs4.inria.fr> <4D8BD02D.1010505@inria.fr> <4D8C73C8.6020801@inescporto.pt> <1301055903.8429.314.camel@thinkpad> <341494683.237537.1301057887481.JavaMail.root@zmbs4.inria.fr> <4D8C944A.9060601@inria.fr> <4D8CB859.9040709@inescporto.pt> <4D8CDDCC.4010000@ens-lyon.org> <4D8CEAA4.2030403@inescporto.pt> <1301084818.8429.435.camel@thinkpad>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1301084818.8429.435.camel@thinkpad>
User-Agent: Mutt/1.5.18 (2008-05-17)
Subject: Re: [Caml-list] Efficient OCaml multicore -- roadmap?


Obligatory pointer to Uli Drepper's series about what every programmer
should know about memory.  Part 1 is here, and parts 2-9 are linked to
at the end of the article just before the comments:

http://lwn.net/Articles/250967/

To expand on what I said before:

The sort of high end hardware we're seeing now has 128 cores and
hundreds of gigabytes or even a terabyte of non-uniform RAM.

The cores are grouped into NUMA nodes with each node having its own
attached memory, southbridge and hardware (network ports etc) -- in
effect separate computers interconnected with some very fast and very
specialized "network" channels that exist but are invisible to the
programmer.

If you 'cross' a node boundary, eg. by having a program or its data
located in the memory in one node and running on a core in another
node, then you suffer some penalty (10-40% directly, plus a lot of
hard-to-measure indirect costs from consuming channel resources
between nodes).

Obviously we try hard to schedule things and to pin processes, virtual
machines and so on so that this never happens.

Having said the above, even straight SMP isn't very uniform.  You've
still got to take into account caches and cache consistency (which at
the hardware level is message passing or bus snooping).

Rich.

-- 
Richard Jones
Red Hat