From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <47CBC0FD.1010406@free.fr>
Date: Mon,  3 Mar 2008 10:12:29 +0100
From: Philippe Anel <xigh@free.fr>
User-Agent: Thunderbird 2.0.0.12 (Windows/20080213)
MIME-Version: 1.0
To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu>
Subject: Re: [9fans] GCC/G++: some stress testing
References: <d5033e6962e97bb803b4f3feee886f55@quanstro.net>	<E329B8ED-63E6-42C7-BB0B-81F4FAF79962@telus.net>	<47CB236A.2020402@free.fr>
	<13426df10803021919l45b67b63uea5b8871bd2fb5a5@mail.gmail.com>
In-Reply-To: <13426df10803021919l45b67b63uea5b8871bd2fb5a5@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Topicbox-Message-UUID: 6d6130f2-ead3-11e9-9d60-3106f5b1d025


Ron, I thought Paul was talking about cache coherent system on which a
high-contention lock can become a huge problem.
Although the work did by Jim Taft on the NASA project looks very
interesting (and if you have pointers to papers about locking primitive
on such system, I would appreciate), it seems this system is memory
coherent, not cache coherent (coherency maintained by SGI NUMALink
interconnect fabric).
And I agree with you. I also think (global) shared memory for IPC is
more efficient than passing copied data across the nodes, and I suppose
several papers tend to confirm this is the case: today's interconnect
fabrics are lot of faster than memory memory access.
My conjecture (I only have access to a simple dual core machines) is
about locking primitive used in CSP (and IPC), I mean libthread which is
based on rendezvous system call (which does use locking primitives
9/proc.c:sysrendezvous() ). I think this is the only reason why CSP
would not scale well.
Regarding my (other) conjecture about IPI, please read my answer to Paul.

Phil;

>> If CSP system itself takes care about memory hierarchy and uses no
>>  synchronisation (using IPI to send message to another core by example),
>>  CSP scales very well.
>>
>
> Is this something you have measured or is this conjecture?
>

>>  Of course IPI mechanism requires a switch to kernel mode which costs a
>>  lot. But this is necessary only if the destination thread is running on
>>  another core, and I don't think latency is very important in algorigthms
>>  requiring a lot of cpus.
>>
>
> same question.
>
> For a look at an interesting library that scaled well on a 1024-node
> SMP at NASA Ame's, by Jim Taft.
> Short form: use shared memory for IPC, not data sharing.
>
> he's done very well this way.
>
> ron
>
>
>