From mboxrd@z Thu Jan 1 00:00:00 1970 Mime-Version: 1.0 (Apple Message framework v752.3) In-Reply-To: <47CBC0FF.3090602@free.fr> References: <47CB236A.2020402@free.fr> <5A9D2697-A21E-4DF0-AB5E-759CD273BE49@telus.net> <47CBC0FF.3090602@free.fr> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Paul Lalonde Subject: Re: [9fans] GCC/G++: some stress testing Date: Mon, 3 Mar 2008 18:31:46 -0800 To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu> Topicbox-Message-UUID: 6e0686f6-ead3-11e9-9d60-3106f5b1d025 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mar 3, 2008, at 1:12 AM, Philippe Anel wrote: >> > So, does this mean the latency is only required by the I/O system > of your program ? If so, maybe I'm wrong, what you need is to be > able to interrupt working cores and I'm afraid libthread doesn't > help here. > If not and your algorithm requires (a lot of) fast IPC, maybe this > is the reason why it doesn't scale well ? No, the whole simulation has to run in the low-latency space - it's a video game and its rendering engine, which are generally highly heterogeneous workload. And that heterogeneity means that there are many points of contact between various subsystems. And the (semi-) real-time constraint means that you can't just scale the problem up to cover overhead costs. >> >> I don't know what you mean by "CSP system itself takes care about >> memory hierarchy". Do you mean that the CSP implementation does >> something about it, or do you mean that the code using the CSP >> approach takes care of it? > Both :) > I agree with you about the fact programming for the memory > hierarchy is way more important than optimizing CPU clocks. > But I also think synchronization primitives used in CSP systems are > the main reason why CSP programs do not scale well (excepted bad > designed algorithm of course). > I meant that a different CSP implementation, based on different > synchronisation primitive (IPI), can help here. I'm more interested just now in working with lock-free algorithms; I've not made any good measurements of how badly our kernels would hit channels as the number of threads increases. Perhaps some could be mitigated through a better channel implementation. >> >> IPI isn't free either - apart from the OS switch, it generates bus >> traffic that competes with the cache coherence protocols and >> memory traffic; in a well designed compute kernel that saturates >> both compute and bandwidth the latency hiccups so introduced can >> propagate really badly. >> > This is very interesting. For sure IPI is not free. But I thought > the bus traffic generated by IPI was less important than cache > coherence protocols such as MESI, mainly because it is a one way > message. It depends immensely on the hardware implementation of your IPI. If you wind up having to pay for MESI as well, then the advantage becomes less. > I think now IPI are sent through the system bus (local APIC used to > talk through a separate bus), so I agree with you about the fact it > can saturate the bandwidth. But I wonder if locking primitive are > not worse. It would be interesting to test this. Agreed! Paul -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (Darwin) iD8DBQFHzLSSpJeHo/Fbu1wRAkv/AKDKK4fuuWyYCqXv4JqbWWj+RXQd0wCfSFoS b9E6X/a13bg6AzUGT5dLSqU= =ppoF -----END PGP SIGNATURE-----