From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ober.14@osu.edu>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: *
X-Spam-Status: No, score=1.0 required=5.0 tests=AWL,DNS_FROM_RFC_ABUSE,
	SPF_NEUTRAL autolearn=disabled version=3.1.3
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105])
	by yquem.inria.fr (Postfix) with ESMTP id 758CFBB84
	for <caml-list@yquem.inria.fr>; Tue, 15 Jul 2008 16:39:54 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AvMCACdTfEhDWxLCbmdsb2JhbACBWpBeNpt/
X-IronPort-AV: E=Sophos;i="4.30,366,1212357600"; 
   d="scan'208";a="27342085"
Received: from ip67-91-18-194.z18-91-67.customer.algx.net (HELO server1.bertec.net) ([67.91.18.194])
  by mail4-smtp-sop.national.inria.fr with ESMTP; 15 Jul 2008 16:39:53 +0200
Received: from kuba.bertec.net (kuba.bertec.net [192.168.2.16])
	by server1.bertec.net (Postfix) with ESMTP id 2ED981056D8
	for <caml-list@yquem.inria.fr>; Tue, 15 Jul 2008 10:39:52 -0400 (EDT)
From: Kuba Ober <ober.14@osu.edu>
To: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] thousands of CPU cores
Date: Tue, 15 Jul 2008 10:39:51 -0400
User-Agent: KMail/1.9.9
References: <c6c368760807092257v7a125fey534fbc6d5444d14@mail.gmail.com> <200807100944.29221.peng.zang@gmail.com> <200807101500.03079.jon@ffconsultancy.com>
In-Reply-To: <200807101500.03079.jon@ffconsultancy.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200807151039.51519.ober.14@osu.edu>
X-Spam: no; 0.00; parallelism:01 parallelism:01 byte:01 byte:01 amortize:01 mmaped:01 mmaped:01 cheers:01 sram:98 mill:98 caml-list:01 essentially:02 essentially:02 implemented:02 overhead:04 

> > It is a stop-gap solution...
>
> That is not true. Many-core machines will always be decomposed into
> shared-memory clusters of as many cores as possible because shared memory
> parallelism will always be orders of magnitude more efficient than
> distributed parallelism.

The way "shared memory" on today's systems is implemented in hardware is
already by essentially message passing. It's just that hardcoded logic does it
all and provides an impression of shared memory, rather than having software
deal with it.

The fact that the software sees it as shared memory doesn't change the fact
that at current system bandwidths we've already run into physical 
implementation limits that make the smooth, fully-random-access memory
a mere illusion. When you read a single uncached byte out of RAM,
there's a big bunch of housekeeping and what-amounts-to-transactional
processing done at the hardware level.

If you count the "efficiency" of such out-of-the-blue uncached truly random
access in terms of clock cycles, current hardware may be 1-2 orders of
magnitude less efficient than almost any 8-bit microcontroller out there...
On most MCUs you can read a random byte out of the SRAM in say 1-4 clock
cycles. On your commonplace modern multicore CPU, it may take a hundred clock
cycles to do the same, and essentially the same amount of time in terms of the
wall clock (a 2GHz CPU has only 100 times faster clock than a run of the mill
20MHz MCU).

What I'm trying to say is that such random, small memory accesses highlight
the inherent message passing / transactional overhead of the hardware
implementation. Those overheads amortize when you run real number tasks,
not a made-up cold single byte access of course. But they are there.

It's akin to mmaped file: you can use CPU's MMU to implement it in the 
usual OS/stock hardware framework, or you can have an FPGA handle memory
transactions and talk directly to the hard drive. It doesn't change the
fact that it's still a mmaped file :)

Cheers, Kuba