From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: * X-Spam-Status: No, score=1.0 required=5.0 tests=AWL,DNS_FROM_RFC_ABUSE, SPF_NEUTRAL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id 758CFBB84 for ; Tue, 15 Jul 2008 16:39:54 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AvMCACdTfEhDWxLCbmdsb2JhbACBWpBeNpt/ X-IronPort-AV: E=Sophos;i="4.30,366,1212357600"; d="scan'208";a="27342085" Received: from ip67-91-18-194.z18-91-67.customer.algx.net (HELO server1.bertec.net) ([67.91.18.194]) by mail4-smtp-sop.national.inria.fr with ESMTP; 15 Jul 2008 16:39:53 +0200 Received: from kuba.bertec.net (kuba.bertec.net [192.168.2.16]) by server1.bertec.net (Postfix) with ESMTP id 2ED981056D8 for ; Tue, 15 Jul 2008 10:39:52 -0400 (EDT) From: Kuba Ober To: caml-list@yquem.inria.fr Subject: Re: [Caml-list] thousands of CPU cores Date: Tue, 15 Jul 2008 10:39:51 -0400 User-Agent: KMail/1.9.9 References: <200807100944.29221.peng.zang@gmail.com> <200807101500.03079.jon@ffconsultancy.com> In-Reply-To: <200807101500.03079.jon@ffconsultancy.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200807151039.51519.ober.14@osu.edu> X-Spam: no; 0.00; parallelism:01 parallelism:01 byte:01 byte:01 amortize:01 mmaped:01 mmaped:01 cheers:01 sram:98 mill:98 caml-list:01 essentially:02 essentially:02 implemented:02 overhead:04 > > It is a stop-gap solution... > > That is not true. Many-core machines will always be decomposed into > shared-memory clusters of as many cores as possible because shared memory > parallelism will always be orders of magnitude more efficient than > distributed parallelism. The way "shared memory" on today's systems is implemented in hardware is already by essentially message passing. It's just that hardcoded logic does it all and provides an impression of shared memory, rather than having software deal with it. The fact that the software sees it as shared memory doesn't change the fact that at current system bandwidths we've already run into physical implementation limits that make the smooth, fully-random-access memory a mere illusion. When you read a single uncached byte out of RAM, there's a big bunch of housekeeping and what-amounts-to-transactional processing done at the hardware level. If you count the "efficiency" of such out-of-the-blue uncached truly random access in terms of clock cycles, current hardware may be 1-2 orders of magnitude less efficient than almost any 8-bit microcontroller out there... On most MCUs you can read a random byte out of the SRAM in say 1-4 clock cycles. On your commonplace modern multicore CPU, it may take a hundred clock cycles to do the same, and essentially the same amount of time in terms of the wall clock (a 2GHz CPU has only 100 times faster clock than a run of the mill 20MHz MCU). What I'm trying to say is that such random, small memory accesses highlight the inherent message passing / transactional overhead of the hardware implementation. Those overheads amortize when you run real number tasks, not a made-up cold single byte access of course. But they are there. It's akin to mmaped file: you can use CPU's MMU to implement it in the usual OS/stock hardware framework, or you can have an FPGA handle memory transactions and talk directly to the hard drive. It doesn't change the fact that it's still a mmaped file :) Cheers, Kuba