* [9fans] A little ado about taslock
@ 2010-06-21 7:25 Venkatesh Srinivas
2010-06-21 14:21 ` erik quanstrom
0 siblings, 1 reply; 4+ messages in thread
From: Venkatesh Srinivas @ 2010-06-21 7:25 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
[-- Attachment #1: Type: text/plain, Size: 1391 bytes --]
Hi,
Erik's thread about a 16-processor x86 machine convinced me to try something
related to spinlocks.
The current 9 spinlocks are portable code, calling an arch-provided tas() in
a loop to do their thing. On i386, Intel recommends 'PAUSE' in the core of a
spin-lock loop; I modified tas to PAUSE (0xF3 0x90 if you prefer) if the
lock-acquire attempt failed.
In a crude test on a 1.5GHz p4 willamette with a local fossil/venti and
256mb of ram, 'time mk 'CONF=pcf' > /dev/null' in /sys/src/9/pc, on a
fully-built source tree, adding the PAUSE reduced times from an average of
18.97s to 18.84s (across ten runs).
I tinkered a bit further. Removing the increments of glare, inglare and
lockstat.locks, coupled with the PAUSE addition, reduced the average real
time to 18.16s, again across 10 runs.
If taslock.c were arch-specific, we could almost certainly do better - i386
doesn't need the coherence() call in unlock, we could safely test-and-tas
rather than than raw tas().
There're also other places to look at too, wrt to application of
arch-specific bits; see:
http://code.google.com/p/inferno-npe/source/detail?r=b83540e1e77e62a19cbd21d2eb54d43d338716a5for
what XADD can do for incref/decref. Similarly, pc/l.s:_xdec could be
much shorter, again using XADD.
None of these are a huge deal; just thought they might be interesting.
Take care,
-- vs
[-- Attachment #2: Type: text/html, Size: 1739 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [9fans] A little ado about taslock
2010-06-21 7:25 [9fans] A little ado about taslock Venkatesh Srinivas
@ 2010-06-21 14:21 ` erik quanstrom
2010-06-21 16:28 ` Lyndon Nerenberg
0 siblings, 1 reply; 4+ messages in thread
From: erik quanstrom @ 2010-06-21 14:21 UTC (permalink / raw)
To: 9fans
> In a crude test on a 1.5GHz p4 willamette with a local fossil/venti and
> 256mb of ram, 'time mk 'CONF=pcf' > /dev/null' in /sys/src/9/pc, on a
> fully-built source tree, adding the PAUSE reduced times from an average of
> 18.97s to 18.84s (across ten runs).
we tried this at coraid years ago. it's a win — but only on the p4 and
netburst-based xeons with old-and-crappy hyperthreading enabled. it
seems to otherwise be a small loss.
i don't see an actual performance problem on the 16-cpu machine.
i see an apparent performance problem. the 4- and 16- processor
machines have a single-threaded speed ratio of ~ 1:1.7, so since
kprof does sampling on the clock interrupt, it seems reasonable
that processors could get in a timing-predictable loop and get
sampled at different places each time. no way rebalance is using
40% of the cpu, right? the anomoly in time(1) is not yet explained.
but it's clearly not much of a performance problem there was only
a 10% slowdown between 1 core busy and 16 cores busy. that's
likely due to the fact that plan 9 knows nothing of the numa nature
of that board.
richard miller does point out a real problem. idlehands just returns
if conf.nproc>1. this is done so we don't have to wait for the next
clock tick should work become available. this is a power management
problem, not a performance problem. your interesting locking solution
posted previously doesn't help with this. it's not even a locking problem.
a potential solution to this would be to have a new bit array, e.g.
active.schedwait which is set when a proc has no work. the mach
could then call halt. a mach could then check for an idle mach
to wake after reading a proc. an apic ipi would be a suitable wakeup
mechanism with r.t. latencies < 500ns. (www.barrelfish.org/barrelfish_mmcs08.pdf)
one assumes that 500ns/2 + wakeup time ≈ wakeup time.
two unfinished thoughts:
1. it sure wouldn't surprise me if this has been done in plan 9 before.
i'd be interested to know what ken's sequent kernel did.
2. if today 16 machs are possible (and 128 on an intel xeon mp 7500—
8 sockets * 8 core * 2t = 128), what do we expect in 5 years? 128?
- erik
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2010-06-21 16:38 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-21 7:25 [9fans] A little ado about taslock Venkatesh Srinivas
2010-06-21 14:21 ` erik quanstrom
2010-06-21 16:28 ` Lyndon Nerenberg
2010-06-21 16:38 ` David Leimbach
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).