From mboxrd@z Thu Jan  1 00:00:00 1970
Mime-Version: 1.0 (Apple Message framework v752.3)
In-Reply-To: <47CBC0FF.3090602@free.fr>
References: <d5033e6962e97bb803b4f3feee886f55@quanstro.net>	<E329B8ED-63E6-42C7-BB0B-81F4FAF79962@telus.net>	<47CB236A.2020402@free.fr>
	<5A9D2697-A21E-4DF0-AB5E-759CD273BE49@telus.net>
	<47CBC0FF.3090602@free.fr>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <C1694B7C-3E96-4F70-8BE6-94E5E0757464@telus.net>
Content-Transfer-Encoding: 7bit
From: Paul Lalonde <plalonde@telus.net>
Subject: Re: [9fans] GCC/G++: some stress testing
Date: Mon,  3 Mar 2008 18:31:46 -0800
To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu>
Topicbox-Message-UUID: 6e0686f6-ead3-11e9-9d60-3106f5b1d025

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


On Mar 3, 2008, at 1:12 AM, Philippe Anel wrote:
>>
> So, does this mean the latency is only required by the I/O system
> of your program ? If so, maybe I'm wrong, what you need is to be
> able to interrupt working cores and I'm afraid libthread doesn't
> help here.
> If not and your algorithm requires (a lot of) fast IPC, maybe this
> is the reason why it doesn't scale well ?

No, the whole simulation has to run in the low-latency space - it's a
video game and its rendering engine, which are generally highly
heterogeneous workload.  And that heterogeneity means that there are
many points of contact between various subsystems.  And the (semi-)
real-time constraint means that you can't just scale the problem up
to cover overhead costs.

>>
>> I don't know what you mean by "CSP system itself takes care about
>> memory hierarchy".  Do you mean that the CSP implementation does
>> something about it, or do you mean that the code using the CSP
>> approach takes care of it?
> Both :)
> I agree with you about the fact programming for the memory
> hierarchy is way more important than optimizing CPU clocks.
> But I also think synchronization primitives used in CSP systems are
> the main reason why CSP programs do not scale well (excepted bad
> designed algorithm of course).
> I meant that a different CSP implementation, based on different
> synchronisation primitive (IPI), can help here.

I'm more interested just now in working with lock-free algorithms;
I've not made any good measurements of how badly our kernels would
hit channels as the number of threads increases.  Perhaps some could
be mitigated through a better channel implementation.

>>
>> IPI isn't free either - apart from the OS switch, it generates bus
>> traffic that competes with the cache coherence protocols and
>> memory traffic; in a well designed compute kernel that saturates
>> both compute and bandwidth the latency hiccups so introduced can
>> propagate really badly.
>>
> This is very interesting. For sure IPI is not free. But I thought
> the bus traffic generated by IPI was less important than cache
> coherence protocols such as MESI, mainly because it is a one way
> message.

It depends immensely on the hardware implementation of your IPI.  If
you wind up having to pay for MESI as well, then the advantage
becomes less.

> I think now IPI are sent through the system bus (local APIC used to
> talk through a separate bus), so I agree with you about the fact it
> can saturate the bandwidth. But I wonder if locking primitive are
> not worse. It would be interesting to test this.

Agreed!

Paul

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (Darwin)

iD8DBQFHzLSSpJeHo/Fbu1wRAkv/AKDKK4fuuWyYCqXv4JqbWWj+RXQd0wCfSFoS
b9E6X/a13bg6AzUGT5dLSqU=
=ppoF
-----END PGP SIGNATURE-----