9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] interesting timing tests
@ 2010-06-18 23:26 erik quanstrom
  2010-06-19 13:42 ` Richard Miller
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: erik quanstrom @ 2010-06-18 23:26 UTC (permalink / raw)
  To: 9fans

note the extreme system time on the 16 processor machine

a	2 * Intel(R) Xeon(R) CPU            5120  @ 1.86GHz
b	4 * Intel(R) Xeon(R) CPU           E5630  @ 2.53GHz
c	16* Intel(R) Xeon(R) CPU           E5540  @ 2.53GHz

# libsec
a;  objtype=arm time mk>/dev/null
0.44u 0.63s 0.94r 	 mk
b; objtype=arm >/dev/null time mk
0.29u 0.41s 0.63r 	 mk
c; objtype=arm time mk>/dev/null
0.37u 4.38s 0.85r 	 mk

# libc
a; objtype=arm >/dev/null time mk
1.16u 3.25s 4.79r 	 mk
b; objtype=arm time mk>/dev/null
0.72u 2.10s 3.05r 	 mk
c; objtype=arm time mk>/dev/null
0.97u 18.81s 5.80r 	 mk

# kernel
a; time mk>/dev/null
5.10u 2.44s 6.32r 	 mk
b; time mk>/dev/null
2.99u 1.36s 2.88r 	 mk
c; time mk>/dev/null
4.37u 13.31s 4.82r 	 mk

- erik



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] interesting timing tests
  2010-06-18 23:26 [9fans] interesting timing tests erik quanstrom
@ 2010-06-19 13:42 ` Richard Miller
  2010-06-20  1:36   ` erik quanstrom
  2010-06-21 21:11 ` Bakul Shah
  2010-06-22  3:24 ` Lawrence E. Bakst
  2 siblings, 1 reply; 18+ messages in thread
From: Richard Miller @ 2010-06-19 13:42 UTC (permalink / raw)
  To: 9fans

> note the extreme system time on the 16 processor machine

kprof(3)




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] interesting timing tests
  2010-06-19 13:42 ` Richard Miller
@ 2010-06-20  1:36   ` erik quanstrom
  2010-06-20  7:44     ` Richard Miller
  0 siblings, 1 reply; 18+ messages in thread
From: erik quanstrom @ 2010-06-20  1:36 UTC (permalink / raw)
  To: 9fans

On Sat Jun 19 09:44:25 EDT 2010, 9fans@hamnavoe.com wrote:
> > note the extreme system time on the 16 processor machine
>
> kprof(3)

i'm not sure i completely trust kprof these days.  there
seems to be a lot of sampling error.  the last time i tried to
use it to get timing on esp, encryption didn't show up at all
in kprof's output.  trace(3) showed that encryption was 80%
of the total cpu use.  in any event, i was suspecting that ilock
would be a big loser as nproc goes up, and it does appear to
be.  i'm less sure that runproc is really using 62% of the cpu

c; kprof /386/9pccpu /dev/kpdata
total: 70023	in kernel text: 65773	outside kernel text: 4250
KTZERO f0100000
ms	  %	sym
40984	 62.3	runproc
9930	 15.0	ilock
5720	  8.6	_cycles
4360	  6.6	perfticks
1784	  2.7	isaconfig
1600	  2.4	iunlock

cf. the 4 processor 5600 xeon:

b; kprof /386/9pccpu /dev/kpdata
total: 14416	in kernel text: 11434	outside kernel text: 2982
KTZERO f0100000
ms	  %	sym
4036	 35.2	rebalance
2483	 21.7	runproc
1561	 13.6	_cycles
918	  8.0	perfticks
377	  3.2	unlock
337	  2.9	microdelay
259	  2.2	idlehands

- erik



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] interesting timing tests
  2010-06-20  1:36   ` erik quanstrom
@ 2010-06-20  7:44     ` Richard Miller
  2010-06-20 12:45       ` erik quanstrom
  0 siblings, 1 reply; 18+ messages in thread
From: Richard Miller @ 2010-06-20  7:44 UTC (permalink / raw)
  To: 9fans

> in any event, i was suspecting that ilock
> would be a big loser as nproc goes up, and it does appear to
> be.

Spin locks would have been high on my list of suspects.

> i'm less sure that runproc is really using 62% of the cpu

Not impossible, given this:

Proc*
runproc(void)
{
		...
		/* waste time or halt the CPU */
		idlehands();
		...

and this:

void
idlehands(void)
{
	if(conf.nmach == 1)
		halt();
}




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] interesting timing tests
  2010-06-20  7:44     ` Richard Miller
@ 2010-06-20 12:45       ` erik quanstrom
  2010-06-20 16:51         ` Richard Miller
  0 siblings, 1 reply; 18+ messages in thread
From: erik quanstrom @ 2010-06-20 12:45 UTC (permalink / raw)
  To: 9fans

> Spin locks would have been high on my list of suspects.

mine, too.  the 64 bit question is, which spin locks.

>
> > i'm less sure that runproc is really using 62% of the cpu
>
> Not impossible, given this:
>
> Proc*
> runproc(void)
> {
> 		...
> 		/* waste time or halt the CPU */
> 		idlehands();
> 		...
>
> and this:
>
> void
> idlehands(void)
> {
> 	if(conf.nmach == 1)
> 		halt();
> }

yet for one machine conf.nmach == 4 and for the
other conf.nmach == 16; neither is calling halt.

- erik



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] interesting timing tests
  2010-06-20 12:45       ` erik quanstrom
@ 2010-06-20 16:51         ` Richard Miller
  2010-06-20 21:55           ` erik quanstrom
  0 siblings, 1 reply; 18+ messages in thread
From: Richard Miller @ 2010-06-20 16:51 UTC (permalink / raw)
  To: 9fans

> yet for one machine conf.nmach == 4 and for the
> other conf.nmach == 16; neither is calling halt.

Hypothesis: with four processors there's enough work to keep all
the cpus busy.  With sixteen processors you're getting i/o bound
(where's the filesystem coming from?) so some of the cpus are
idling, and would call halt if they were allowed to.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] interesting timing tests
  2010-06-20 16:51         ` Richard Miller
@ 2010-06-20 21:55           ` erik quanstrom
  2010-06-21  1:41             ` erik quanstrom
  0 siblings, 1 reply; 18+ messages in thread
From: erik quanstrom @ 2010-06-20 21:55 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 3855 bytes --]

> > yet for one machine conf.nmach == 4 and for the
> > other conf.nmach == 16; neither is calling halt.
>
> Hypothesis: with four processors there's enough work to keep all
> the cpus busy.  With sixteen processors you're getting i/o bound
> (where's the filesystem coming from?) so some of the cpus are
> idling, and would call halt if they were allowed to.

it seems to be a tad more complicated than that.
these machines "aren't doing anything"; i'm the only
one logged in, and they run no services.

16cpus; >/dev/kpctl echo startclr; sleep 5;>/dev/kpctl echo stop; \
	kprof /386/9pccpu /dev/kpdata | sed 10q
total: 79782	in kernel text: 79782	outside kernel text: 0
KTZERO f0100000
ms	  %	sym
60348	 75.2	runproc
9125	 11.3	_cycles
6230	  7.7	perfticks
2696	  3.3	isaconfig
1271	  1.5	idlehands
1127	  1.4	microdelay
1	  0.0	freepte

4cpus; >/dev/kpctl echo startclr; sleep 5;>/dev/kpctl echo stop; \
	kprof /386/9pccpu /dev/kpdata | sed 10q
total: 20327	in kernel text: 20327	outside kernel text: 0
KTZERO f0100000
ms	  %	sym
8124	 40.2	rebalance
5261	 26.0	runproc
3252	 16.1	_cycles
1997	  9.9	perfticks
702	  3.4	microdelay
548	  2.7	idlehands
349	  1.7	isaconfig

this trend continues with burncycles, a program
(attached) that actually does stuff on n cpus:

4cpus; for(i in 1 2 4){
	>/dev/kpctl echo startclr;
	>/dev/null time 8.burncycles $i;
	>/dev/kpctl echo stop;
	kprof /386/9pccpu /dev/kpdata|sed 10q
}
10.56u 0.00s 10.56r 	 8.burncycles 1
total: 42246	in kernel text: 31684	outside kernel text: 10562
KTZERO f0100000
ms	  %	sym
12693	 40.0	rebalance
8324	 26.2	runproc
5215	 16.4	_cycles
3182	 10.0	perfticks
1088	  3.4	microdelay
902	  2.8	idlehands
611	  1.9	isaconfig
10.56u 0.00s 10.56r 	 8.burncycles 2
total: 42561	in kernel text: 21441	outside kernel text: 21120
KTZERO f0100000
ms	  %	sym
8567	 39.9	rebalance
5558	 25.9	runproc
3483	 16.2	_cycles
2190	 10.2	perfticks
742	  3.4	microdelay
590	  2.7	idlehands
408	  1.9	isaconfig
10.56u 0.00s 10.56r 	 8.burncycles 4
total: 42524	in kernel text: 428	outside kernel text: 42096
KTZERO f0100000
ms	  %	sym
159	 37.1	rebalance
120	 28.0	runproc
63	 14.7	_cycles
49	 11.4	perfticks
17	  3.9	idlehands
9	  2.1	isaconfig
9	  2.1	microdelay

8cpus; for(i in 1 2 4 8 16){
	>/dev/kpctl echo startclr;
	>/dev/null time 8.burncycles $i;
	>/dev/kpctl echo stop;
	kprof /386/9pccpu /dev/kpdata|sed 10q
}
17.26u 0.00s 17.26r 	 8.burncycles 1
total: 265856	in kernel text: 248594	outside kernel text: 17262
KTZERO f0100000
ms	  %	sym
191427	 77.0	runproc
28607	 11.5	_cycles
21218	  8.5	perfticks
8618	  3.4	isaconfig
4408	  1.7	idlehands
3584	  1.4	microdelay
1	  0.0	nhgets
17.64u 0.00s 17.64r 	 8.burncycles 2
total: 276561	in kernel text: 241360	outside kernel text: 35201
KTZERO f0100000
ms	  %	sym
181186	 75.0	runproc
26816	 11.1	_cycles
23267	  9.6	perfticks
8049	  3.3	isaconfig
4261	  1.7	idlehands
3483	  1.4	microdelay
2	  0.0	sleep
18.87u 0.00s 18.87r 	 8.burncycles 4
total: 297021	in kernel text: 225113	outside kernel text: 71908
KTZERO f0100000
ms	  %	sym
168136	 74.6	runproc
24904	 11.0	_cycles
22849	 10.1	perfticks
7462	  3.3	isaconfig
3879	  1.7	idlehands
3058	  1.3	microdelay
1	  0.0	ilock
18.65u 0.00s 18.65r 	 8.burncycles 8
total: 289838	in kernel text: 148804	outside kernel text: 141034
KTZERO f0100000
ms	  %	sym
117215	 78.7	runproc
16729	 11.2	_cycles
12872	  8.6	perfticks
5119	  3.4	isaconfig
2765	  1.8	idlehands
2064	  1.3	microdelay
2	  0.0	sleep
19.34u 0.00s 19.35r 	 8.burncycles 16
total: 281308	in kernel text: -9895	outside kernel text: 291203
KTZERO f0100000
ms	  %	sym
497	 -5.0	runproc
78	  0.-7	_cycles
50	  0.-5	perfticks
14	  0.-1	isaconfig
10	  0.-1	microdelay
8	  0.0	idlehands
1	  0.0	ilock

- erik

[-- Attachment #2: burncycles.c --]
[-- Type: text/plain, Size: 787 bytes --]

#include <u.h>
#include <libc.h>
#include <thread.h>

#define Scale	(100000000000ull)

/*
 * waste time
 */
vlong
πjj(uint j)
{
	vlong v;

	v = 4*Scale / (2*j + 1);
	if(j&1)
		return -v;
	return v;
}

vlong
π(void)
{
	uint i;
	vlong v;

	v = 0;
	for(i = 0; i < 500000000; i++)
		v += πjj(i);
	return v;
}

void
p(void *v)
{
	int i;

	i = (int)v;
	print("%d: %lld\n", i, π());
	threadexits("");
}

void
usage(void)
{
	fprint(2, "usage: burncycles nthread\n");
	threadexits("usage");
}

void
threadmain(int argc, char **argv)
{
	int n, i;

	ARGBEGIN{
	default:
		usage();
	}ARGEND
	n = 1;
	else if(argc > 1 || (n = atoi(argv[0])) <= 0)
		usage();
	for(i = 0; i < n-1; i++)
		proccreate(p, (void*)i, 4096);
	p((void*)i);
}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] interesting timing tests
  2010-06-20 21:55           ` erik quanstrom
@ 2010-06-21  1:41             ` erik quanstrom
  2010-06-21  3:46               ` Venkatesh Srinivas
  0 siblings, 1 reply; 18+ messages in thread
From: erik quanstrom @ 2010-06-21  1:41 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 1072 bytes --]

oops.  botched fix of harmess warning.
corrected source attached.

just for a giggle, i ran this test on a few handy machines to get a feel
for relative speed of a single core.  since this test is small enough to
fit in the tiniest cache, i would think that memory speed or any other
external factor would be unimportant:

open rd/marvell kirkwood				471.97u 0.00s 472.25r
Intel(R) Atom(TM) CPU  330   @ 1.60GHz		48.47u 0.00s 48.48r
Intel(R) Pentium(R) 4 CPU 3.00GHz		40.72u 0.00s 40.76r
AMD Athlon(tm) 64 X2 Dual Core Processor 3800+	30.62u 0.00s 30.64r
AMD Athlon(tm) 64 X2 Dual Core Processor 5000+	23.18u 0.00s 23.19r
Intel(R) Xeon(R) CPU            5120  @ 1.86GHz	23.16u 0.00s 23.08r
Intel(R) Xeon(R) CPU           E5540  @ 2.53GHz	17.26u 0.00s 17.26r
Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz	16.86u 0.00s 16.86r
Intel(R) Xeon(R) CPU           E5630  @ 2.53GHz	10.46u 0.00s 10.50r

perhaps the arm's vlong arithmitic isn't as well optimized as x86.  the
atom also is conspicuously slow, but unfortunately with no obvious
excuses.

- erik

[-- Attachment #2: burncycles.c --]
[-- Type: text/plain, Size: 814 bytes --]

#include <u.h>
#include <libc.h>
#include <thread.h>

#define Scale	(100000000000ull)

/*
 * waste time
 */
vlong
πjj(uint j)
{
	vlong v;

	v = 4*Scale / (2*j + 1);
	if(j&1)
		return -v;
	return v;
}

vlong
π(void)
{
	uint i;
	vlong v;

	v = 0;
	for(i = 0; i < 500000000; i++)
		v += πjj(i);
	return v;
}

void
p(void *v)
{
	int i;

	i = (int)v;
	print("%d: %lld\n", i, π());
	threadexits("");
}

void
usage(void)
{
	fprint(2, "usage: burncycles nthread\n");
	threadexits("usage");
}

void
threadmain(int argc, char **argv)
{
	int n, i;

	ARGBEGIN{
	default:
		usage();
	}ARGEND
	n = 0;
	if(argc == 0)
		n = 1;
	else if(argc != 1 || (n = atoi(argv[0])) <= 0)
		usage();
	for(i = 0; i < n-1; i++)
		proccreate(p, (void*)i, 4096);
	p((void*)i);
}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] interesting timing tests
  2010-06-21  1:41             ` erik quanstrom
@ 2010-06-21  3:46               ` Venkatesh Srinivas
  2010-06-21 14:40                 ` erik quanstrom
  0 siblings, 1 reply; 18+ messages in thread
From: Venkatesh Srinivas @ 2010-06-21  3:46 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 1144 bytes --]

knowing which locks would rock.

i imagine the easiest way to find out would be modify lock() to bump a
per-lock ctr on failure-to-acquire. on i386 lock add would be the easiest
way to do that, i think. add an 'inited' field to the spinlock and a list
linkage as well, to allow for easy examination when you hit the system with
acid.

also if the locks in question need to be locked and the resources they
protect cannot be split, we can do much better than our current spinlocks:

void lock(int *l) {
    int old = __sync_fetch_and_add(l, 1);
    short next,owner;

    do {
        next = old & 0x0000FFFF;
        owner = (old >> 16) & 0x0000FFFF;

        old = *l;
    } while(next != owner);
}


void unlock(int *l) {
    __sync_fetch_and_add(l, (1 << 16));
}

(this is in gcc-C, but porting wouldn't be bad; the unlock
__sync_fetch_and_add would be LOCK ADD on i386. the __sync_fetch_and_add in
lock would be LOCK XADD on i386. i don't know 8a's syntax well enough to do
this right, in particular how 8a's pseudoregs work).

(many credits to nick piggin for this lock design. its totally rad.)

-- vs

[-- Attachment #2: Type: text/html, Size: 1528 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] interesting timing tests
  2010-06-21  3:46               ` Venkatesh Srinivas
@ 2010-06-21 14:40                 ` erik quanstrom
  2010-06-21 16:42                   ` Venkatesh Srinivas
  2010-06-21 16:43                   ` erik quanstrom
  0 siblings, 2 replies; 18+ messages in thread
From: erik quanstrom @ 2010-06-21 14:40 UTC (permalink / raw)
  To: 9fans

void
lock(ulong *l)
{
	ulong old;
	ushort next, owner;

	old = _xadd(l, 1);
	for(;;){
		next = old;
		owner = old>>16;
		old = *l;
		if(next == owner)
			break;
	}
}

void
unlock(ulong *l)
{
	_xadd(l, 1<<16);
}

- erik



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] interesting timing tests
  2010-06-21 14:40                 ` erik quanstrom
@ 2010-06-21 16:42                   ` Venkatesh Srinivas
  2010-06-21 16:43                   ` erik quanstrom
  1 sibling, 0 replies; 18+ messages in thread
From: Venkatesh Srinivas @ 2010-06-21 16:42 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 857 bytes --]

On Mon, Jun 21, 2010 at 10:40 AM, erik quanstrom <quanstro@quanstro.net>wrote:

> void
> lock(ulong *l)
> {
>        ulong old;
>        ushort next, owner;
>
>        old = _xadd(l, 1);
>        for(;;){
>                next = old;
>                owner = old>>16;
>                old = *l;
>                if(next == owner)
>                        break;
>        }
> }
>
> void
> unlock(ulong *l)
> {
>        _xadd(l, 1<<16);
> }


Sure, that's reasonable in C; (i wasn't sure how to do it in asm for 8_a_,
that was what I was asking about). Just also remember to provide xadd; the
distribution 8a and 8l didn't support it last I checked.

Just another observation, we can bypass the load of old in the uncontended
case if we reverse old = *l and the compare/break in lock.

Anyway, thoughts on this lock?

-- vs

[-- Attachment #2: Type: text/html, Size: 1332 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] interesting timing tests
  2010-06-21 14:40                 ` erik quanstrom
  2010-06-21 16:42                   ` Venkatesh Srinivas
@ 2010-06-21 16:43                   ` erik quanstrom
  1 sibling, 0 replies; 18+ messages in thread
From: erik quanstrom @ 2010-06-21 16:43 UTC (permalink / raw)
  To: 9fans

On Mon Jun 21 10:51:30 EDT 2010, quanstro@quanstro.net wrote:
> void
> lock(ulong *l)

somehow lost was an observation that since lock
is only testing that next == owner, and that both
are based on the current state of *l, i don't see how
this is robust in the face of more than one mach
spinning.  who wins?  am i missing something?

also lost was the assembly which should be (from
memory) something like

TEXT _xadd(SB), 1, $0
	MOVL	l+0(FP), BX
	MOVL	n+4(FP), AX
	LOCK; XADD AX, 0(BX)
	RET

unfortunately that is not accepted by the assembler,
and (hopefully) equivalent BYTE statements were
rejected by the linker.  perhaps someone knows immediately
how to sneak XADD in; i haven't yet investigated.

- erik



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] interesting timing tests
  2010-06-18 23:26 [9fans] interesting timing tests erik quanstrom
  2010-06-19 13:42 ` Richard Miller
@ 2010-06-21 21:11 ` Bakul Shah
  2010-06-21 21:21   ` erik quanstrom
  2010-06-22  3:24 ` Lawrence E. Bakst
  2 siblings, 1 reply; 18+ messages in thread
From: Bakul Shah @ 2010-06-21 21:11 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Fri, 18 Jun 2010 19:26:25 EDT erik quanstrom <quanstro@labs.coraid.com>  wrote:
> note the extreme system time on the 16 processor machine

Could this be due to memory contention caused by spinlocks?
While locks are spinning they eat up memory bandwidth which
slows down everyone's memory accesses (including the one who
is trying to finish its work while holding the spinlock).
And the more processors contend, the worse it gets....

How well does plan9 lock() scale with the number of processor?

Since this is analogous to accessing a CSMA network, one can
use a similar algorithm to ameliorate the bandwidth problem:
if you didn't get the lock, assume it will be a little while
before you can get it so you might as well backoff.  There is
an old paper that talks about this.

Or it could simply be due to caching behaviour, if everyone
is accessing/mutating the same pages at the same time.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] interesting timing tests
  2010-06-21 21:11 ` Bakul Shah
@ 2010-06-21 21:21   ` erik quanstrom
  2010-06-21 21:47     ` Bakul Shah
  0 siblings, 1 reply; 18+ messages in thread
From: erik quanstrom @ 2010-06-21 21:21 UTC (permalink / raw)
  To: 9fans

> > note the extreme system time on the 16 processor machine
>
> Could this be due to memory contention caused by spinlocks?
> While locks are spinning they eat up memory bandwidth which
> slows down everyone's memory accesses (including the one who
> is trying to finish its work while holding the spinlock).
> And the more processors contend, the worse it gets....

perhaps.

> How well does plan9 lock() scale with the number of processor?

i think the question is, are there any spin locks that can become
unreasonablly contended as conf.nmach goes up.  if so, i would
think that rather than finding the optimal solution to pessimal
use of spinlocks, we should look to optimize our use of spinlocks.

the underlying assumption is that the contended case is rare.
if this is not the case, then spin locks are not a good choice.

- erik



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] interesting timing tests
  2010-06-21 21:21   ` erik quanstrom
@ 2010-06-21 21:47     ` Bakul Shah
  2010-06-21 22:16       ` erik quanstrom
  0 siblings, 1 reply; 18+ messages in thread
From: Bakul Shah @ 2010-06-21 21:47 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, 21 Jun 2010 17:21:36 EDT erik quanstrom <quanstro@quanstro.net>  wrote:
> > > note the extreme system time on the 16 processor machine
> >
> > Could this be due to memory contention caused by spinlocks?
> > While locks are spinning they eat up memory bandwidth which
> > slows down everyone's memory accesses (including the one who
> > is trying to finish its work while holding the spinlock).
> > And the more processors contend, the worse it gets....
>
> perhaps.

Is there a way to check this?

Is there a way to completely shut off N processors and
measure benchmark speed slow down as function of processor?

> > How well does plan9 lock() scale with the number of processor?
>
> i think the question is, are there any spin locks that can become
> unreasonablly contended as conf.nmach goes up.  if so, i would
> think that rather than finding the optimal solution to pessimal
> use of spinlocks, we should look to optimize our use of spinlocks.

I mentioned this as something to check but I wouldn't be
surprised the problem is a combination of factors.  So first
you have to find out if this is the problem in your case
before worrying about it.

> the underlying assumption is that the contended case is rare.
> if this is not the case, then spin locks are not a good choice.

With 8 dual HT processors the probability has gone up quite a
bit!

And what will you replace spinlocks with? The underlying
issue is contention due to sharing. If you can reduce sharing
you can reduce contention. Backoff alg. seems promising
because it can reduce memory access where it matters most.
For tens & tens of processors or more, message passing is the
only way but that would be a major redesign!



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] interesting timing tests
  2010-06-21 21:47     ` Bakul Shah
@ 2010-06-21 22:16       ` erik quanstrom
  0 siblings, 0 replies; 18+ messages in thread
From: erik quanstrom @ 2010-06-21 22:16 UTC (permalink / raw)
  To: 9fans

> Is there a way to check this?
>
> Is there a way to completely shut off N processors and
> measure benchmark speed slow down as function of processor?

there hasn't been any performance impact measured.
however, the extreme system time still seems wierd.

richard miller suggested that kprof might be suffering from sampleing
error, and moving rebalance to the end of the file confirmed.  both
the 4- and 16- processor machines have similar behavior.

> > the underlying assumption is that the contended case is rare.
> > if this is not the case, then spin locks are not a good choice.
>
> With 8 dual HT processors the probability has gone up quite a
> bit!

only if contention depends on processing speed.  (locks may be
interrupted if they're not ilocks.)

for example, a lock protecting packet rx for an ethernet
driver would not depend strongly on the number of processors or
processing speed.

> And what will you replace spinlocks with?

the right answer here is likely "mu".  some spinlocks
might need to be replaced with another structure.
but i would think that would depend entirely on the
situation.  in general, i am not suggesting
depricating spinlocks.

- erik



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] interesting timing tests
  2010-06-18 23:26 [9fans] interesting timing tests erik quanstrom
  2010-06-19 13:42 ` Richard Miller
  2010-06-21 21:11 ` Bakul Shah
@ 2010-06-22  3:24 ` Lawrence E. Bakst
  2010-06-23  1:09   ` erik quanstrom
  2 siblings, 1 reply; 18+ messages in thread
From: Lawrence E. Bakst @ 2010-06-22  3:24 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Do you have a way to turn off one of the sockets on "c" (2 x E5540) and get the numbers with HT (8 processors) and without HT (4 processors)? It would also be interesting to see "c" with HT turned off.

Certainly it seems to me that idlehands needs to be fixed, your bit array "active.schedwait" is one way.

In my experience of bringing up the Alliant FX/8 mini-supercomputer which had 8 (mostly CPU) + 12 (mostly I/O) = 20 processors there were a bunch of details that had to addressed as we went from 1 to 20 processors. There were even some issues with the system timing (user, sys, real) itself being messed up, but I can't remember the details.

I do remember one customer that had a billing system complain that they had their own customers complaining that in high I/O environments they were getting charged for interrupts (included in sys at the time) they didn't incur, which was true. I think we fixed that one, by having an idle process per CPU and charging each interrupt to the processor idle process.

I mention it because we got a lot of mileage out of the decision to give every processor an idle process. Our scheduler was set up to only run that process if there were no other processes available for that processor. When the idle process did run it did a few things and then called halt. There is some more to the story and if anyone is interested, let me know and I'll either post a follow up or I can respond in private.

We used to have a saying at Alliant: "Data drives out speculation".

leb


At 7:26 PM -0400 6/18/10, erik quanstrom wrote:
>note the extreme system time on the 16 processor machine
>
>a	2 * Intel(R) Xeon(R) CPU            5120  @ 1.86GHz
>b	4 * Intel(R) Xeon(R) CPU           E5630  @ 2.53GHz
>c	16* Intel(R) Xeon(R) CPU           E5540  @ 2.53GHz

--
leb@iridescent.org




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] interesting timing tests
  2010-06-22  3:24 ` Lawrence E. Bakst
@ 2010-06-23  1:09   ` erik quanstrom
  0 siblings, 0 replies; 18+ messages in thread
From: erik quanstrom @ 2010-06-23  1:09 UTC (permalink / raw)
  To: 9fans

> Do you have a way to turn off one of the sockets on "c" (2 x E5540) and get the numbers with HT (8 processors) and without HT (4 processors)? It would also be interesting to see "c" with HT turned off.

here's the progression
4	4.41u 1.83s 4.06r 	 	0. %ilock
8	4.47u 2.37s 3.60r 	 	2.0
12	4.49u 8.34s 4.40r 	 	11.0
16	4.36u 13.16s 4.43r		14.7

here's a fun little calculation:
	16 threads * 4.43 s * 0.147 + 1.83s baseline
		= 10.41936 + 1.83 thread*s
		= 12.25s
it seems that increased ilock contention is a big factor
in the increase in system time.

ilock accounting has most (>80%) long-held ilocks
(>8.5µs, ~21k cycles) starting here /sys/src/libc/port/pool.c:1318.
this is no surprise.  technically, a long-held ilock is not
really a problem—until somebody else wants it.  but we
can be fairly certain that allocb/malloc is a fairly contended code
path.

hopefully i'll be able to test a less-contended replacement for
allocb/freeb before i run out of time with this machine.

> Certainly it seems to me that idlehands needs to be fixed,
> your bit array "active.schedwait" is one way.

i'm not convinced that idlehands is anything but a power-waster.
performance wise, it's nearly ideal.

- erik



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2010-06-23  1:09 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-18 23:26 [9fans] interesting timing tests erik quanstrom
2010-06-19 13:42 ` Richard Miller
2010-06-20  1:36   ` erik quanstrom
2010-06-20  7:44     ` Richard Miller
2010-06-20 12:45       ` erik quanstrom
2010-06-20 16:51         ` Richard Miller
2010-06-20 21:55           ` erik quanstrom
2010-06-21  1:41             ` erik quanstrom
2010-06-21  3:46               ` Venkatesh Srinivas
2010-06-21 14:40                 ` erik quanstrom
2010-06-21 16:42                   ` Venkatesh Srinivas
2010-06-21 16:43                   ` erik quanstrom
2010-06-21 21:11 ` Bakul Shah
2010-06-21 21:21   ` erik quanstrom
2010-06-21 21:47     ` Bakul Shah
2010-06-21 22:16       ` erik quanstrom
2010-06-22  3:24 ` Lawrence E. Bakst
2010-06-23  1:09   ` erik quanstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).