On Thu, 9 Mar 2006, skaller wrote:

> Ahem. Now try that on an AMDx2 (dual core). The cost goes through
> the roof if one process has a thread on each core. Because each
> core has its own cache and both caches have to be flushed/
> synchronised. And those caches are BIG!

Love to.  Wanna buy me the box?  :-}  Seriously- my code is attached, if 
someone wants to run it on other boxes and post the results, feel free. 
It's GNU-C/x86 specific, as I'm using GNU C's inline assembler and the 
rdtsc instruction to get accurate cycle counts.

As to the cache comment: the whole caches don't have to be flushed, just 
the line the mutex is on.  Which makes it approximately the cost of a 
cache miss- that's a good approximation of the cost of getting an 
uncontended lock.

>
> I have no idea if Linux, for example, running SMP kernel,
> is smart enough to know if a mutex is shared between two
> processing units or not: AFAIK Linux doesn't support
> interprocess mutex. Windows does. Be interesting to
> compare.

It doesn't look like the mutex software is even going into the kernel. 
I don't think the Linux kernel even knows the mutex *exists*, let alone 
what threads are competing for it.  On the x86, at least, lock 
instructions are not priveledged.

>
> As mentioned before the only data I have at the moment
> is a two thread counter increment experiment on a dual
> CPU G5 box, where the speed up from 2 CPUs vs 1 was
> a factor of 15 .. times SLOWER.

If you're ping-ponging a cache line between two CPUs (and the AMD dual 
cores count as two CPUs), then I can easily beleive that.

So?

Brian