From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 From: Venkatesh Srinivas Date: Mon, 21 Jun 2010 03:25:32 -0400 Message-ID: To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: multipart/alternative; boundary=0016e6d7e6b7f02eae048985362f Subject: [9fans] A little ado about taslock Topicbox-Message-UUID: 35e9d22a-ead6-11e9-9d60-3106f5b1d025 --0016e6d7e6b7f02eae048985362f Content-Type: text/plain; charset=UTF-8 Hi, Erik's thread about a 16-processor x86 machine convinced me to try something related to spinlocks. The current 9 spinlocks are portable code, calling an arch-provided tas() in a loop to do their thing. On i386, Intel recommends 'PAUSE' in the core of a spin-lock loop; I modified tas to PAUSE (0xF3 0x90 if you prefer) if the lock-acquire attempt failed. In a crude test on a 1.5GHz p4 willamette with a local fossil/venti and 256mb of ram, 'time mk 'CONF=pcf' > /dev/null' in /sys/src/9/pc, on a fully-built source tree, adding the PAUSE reduced times from an average of 18.97s to 18.84s (across ten runs). I tinkered a bit further. Removing the increments of glare, inglare and lockstat.locks, coupled with the PAUSE addition, reduced the average real time to 18.16s, again across 10 runs. If taslock.c were arch-specific, we could almost certainly do better - i386 doesn't need the coherence() call in unlock, we could safely test-and-tas rather than than raw tas(). There're also other places to look at too, wrt to application of arch-specific bits; see: http://code.google.com/p/inferno-npe/source/detail?r=b83540e1e77e62a19cbd21d2eb54d43d338716a5for what XADD can do for incref/decref. Similarly, pc/l.s:_xdec could be much shorter, again using XADD. None of these are a huge deal; just thought they might be interesting. Take care, -- vs --0016e6d7e6b7f02eae048985362f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi,

Erik's thread about a 16-processor x86 machine c= onvinced me to try something related to spinlocks.

The current 9 spinlocks are portable code, calling an arch-provided tas() = in a loop to do their thing. On i386, Intel recommends 'PAUSE' in t= he core of a spin-lock loop; I modified tas to PAUSE (0xF3 0x90 if you pref= er) if the lock-acquire attempt failed.

In a crude test on a 1.5GHz p4 willamette with a local = fossil/venti and 256mb of ram, 'time mk 'CONF=3Dpcf' > /dev/= null' in /sys/src/9/pc, on a fully-built source tree, adding the PAUSE = reduced times from an average of 18.97s to 18.84s (across ten runs).

I tinkered a bit further. Removing the increments of gl= are, inglare and lockstat.locks, coupled with the PAUSE addition, reduced t= he average real time to 18.16s, again across 10 runs.=C2=A0

<= /div>
If taslock.c were arch-specific, we could almost certainly do better -= i386 doesn't need the coherence() call in unlock, we could safely test= -and-tas rather than than raw tas().

There're = also other places to look at too, wrt to application of arch-specific bits;= see:=C2=A0http://code.google.com/p/infer= no-npe/source/detail?r=3Db83540e1e77e62a19cbd21d2eb54d43d338716a5 for w= hat XADD can do for incref/decref. Similarly, pc/l.s:_xdec could be much sh= orter, again using XADD.

None of these are a huge deal; just thought they might = be interesting.

Take care,
-- vs<= br>
--0016e6d7e6b7f02eae048985362f-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: erik quanstrom Date: Mon, 21 Jun 2010 10:21:36 -0400 To: 9fans@9fans.net Message-ID: <8668dded1f0a71f7c699f3f4ee7cf18c@kw.quanstro.net> In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Subject: Re: [9fans] A little ado about taslock Topicbox-Message-UUID: 35ef5740-ead6-11e9-9d60-3106f5b1d025 > In a crude test on a 1.5GHz p4 willamette with a local fossil/venti and > 256mb of ram, 'time mk 'CONF=pcf' > /dev/null' in /sys/src/9/pc, on a > fully-built source tree, adding the PAUSE reduced times from an average of > 18.97s to 18.84s (across ten runs). we tried this at coraid years ago. it's a win — but only on the p4 and netburst-based xeons with old-and-crappy hyperthreading enabled. it seems to otherwise be a small loss. i don't see an actual performance problem on the 16-cpu machine. i see an apparent performance problem. the 4- and 16- processor machines have a single-threaded speed ratio of ~ 1:1.7, so since kprof does sampling on the clock interrupt, it seems reasonable that processors could get in a timing-predictable loop and get sampled at different places each time. no way rebalance is using 40% of the cpu, right? the anomoly in time(1) is not yet explained. but it's clearly not much of a performance problem there was only a 10% slowdown between 1 core busy and 16 cores busy. that's likely due to the fact that plan 9 knows nothing of the numa nature of that board. richard miller does point out a real problem. idlehands just returns if conf.nproc>1. this is done so we don't have to wait for the next clock tick should work become available. this is a power management problem, not a performance problem. your interesting locking solution posted previously doesn't help with this. it's not even a locking problem. a potential solution to this would be to have a new bit array, e.g. active.schedwait which is set when a proc has no work. the mach could then call halt. a mach could then check for an idle mach to wake after reading a proc. an apic ipi would be a suitable wakeup mechanism with r.t. latencies < 500ns. (www.barrelfish.org/barrelfish_mmcs08.pdf) one assumes that 500ns/2 + wakeup time ≈ wakeup time. two unfinished thoughts: 1. it sure wouldn't surprise me if this has been done in plan 9 before. i'd be interested to know what ken's sequent kernel did. 2. if today 16 machs are possible (and 128 on an intel xeon mp 7500— 8 sockets * 8 core * 2t = 128), what do we expect in 5 years? 128? - erik From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Mon, 21 Jun 2010 09:28:46 -0700 From: Lyndon Nerenberg To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> In-Reply-To: <8668dded1f0a71f7c699f3f4ee7cf18c@kw.quanstro.net> Message-ID: References: <8668dded1f0a71f7c699f3f4ee7cf18c@kw.quanstro.net> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Subject: Re: [9fans] A little ado about taslock Topicbox-Message-UUID: 35f9dcce-ead6-11e9-9d60-3106f5b1d025 > 2. if today 16 machs are possible (and 128 on an intel xeon mp 7500? > 8 sockets * 8 core * 2t = 128), what do we expect in 5 years? 128? www.seamicro.com From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: References: <8668dded1f0a71f7c699f3f4ee7cf18c@kw.quanstro.net> Date: Mon, 21 Jun 2010 09:38:05 -0700 Message-ID: From: David Leimbach To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: multipart/alternative; boundary=000e0cd6ac98d3060904898ceddc Subject: Re: [9fans] A little ado about taslock Topicbox-Message-UUID: 35ffbd38-ead6-11e9-9d60-3106f5b1d025 --000e0cd6ac98d3060904898ceddc Content-Type: text/plain; charset=ISO-8859-1 On Mon, Jun 21, 2010 at 9:28 AM, Lyndon Nerenberg wrote: > 2. if today 16 machs are possible (and 128 on an intel xeon mp 7500? >> 8 sockets * 8 core * 2t = 128), what do we expect in 5 years? 128? >> > > www.seamicro.com > > There's a 100 core MIPS-like board available now too. http://www.tilera.com/ Dave --000e0cd6ac98d3060904898ceddc Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

On Mon, Jun 21, 2010 at 9:28 AM, Lyndon = Nerenberg <lyndon= @orthanc.ca> wrote:
2. =A0if today 16 machs are possible (and 128 on an intel xeon mp 7500?
8 sockets * 8 core * 2t =3D 128), what do we expect in 5 years? =A0128?

www.seamicro.com<= br>
There's a 100 core MIPS-like board available now too= .


Dave
--000e0cd6ac98d3060904898ceddc--