* [9fans] interesting timing tests
@ 2010-06-18 23:26 erik quanstrom
2010-06-19 13:42 ` Richard Miller
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: erik quanstrom @ 2010-06-18 23:26 UTC (permalink / raw)
To: 9fans
note the extreme system time on the 16 processor machine
a 2 * Intel(R) Xeon(R) CPU 5120 @ 1.86GHz
b 4 * Intel(R) Xeon(R) CPU E5630 @ 2.53GHz
c 16* Intel(R) Xeon(R) CPU E5540 @ 2.53GHz
# libsec
a; objtype=arm time mk>/dev/null
0.44u 0.63s 0.94r mk
b; objtype=arm >/dev/null time mk
0.29u 0.41s 0.63r mk
c; objtype=arm time mk>/dev/null
0.37u 4.38s 0.85r mk
# libc
a; objtype=arm >/dev/null time mk
1.16u 3.25s 4.79r mk
b; objtype=arm time mk>/dev/null
0.72u 2.10s 3.05r mk
c; objtype=arm time mk>/dev/null
0.97u 18.81s 5.80r mk
# kernel
a; time mk>/dev/null
5.10u 2.44s 6.32r mk
b; time mk>/dev/null
2.99u 1.36s 2.88r mk
c; time mk>/dev/null
4.37u 13.31s 4.82r mk
- erik
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] interesting timing tests
2010-06-18 23:26 [9fans] interesting timing tests erik quanstrom
@ 2010-06-19 13:42 ` Richard Miller
2010-06-20 1:36 ` erik quanstrom
2010-06-21 21:11 ` Bakul Shah
2010-06-22 3:24 ` Lawrence E. Bakst
2 siblings, 1 reply; 18+ messages in thread
From: Richard Miller @ 2010-06-19 13:42 UTC (permalink / raw)
To: 9fans
> note the extreme system time on the 16 processor machine
kprof(3)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] interesting timing tests
2010-06-19 13:42 ` Richard Miller
@ 2010-06-20 1:36 ` erik quanstrom
2010-06-20 7:44 ` Richard Miller
0 siblings, 1 reply; 18+ messages in thread
From: erik quanstrom @ 2010-06-20 1:36 UTC (permalink / raw)
To: 9fans
On Sat Jun 19 09:44:25 EDT 2010, 9fans@hamnavoe.com wrote:
> > note the extreme system time on the 16 processor machine
>
> kprof(3)
i'm not sure i completely trust kprof these days. there
seems to be a lot of sampling error. the last time i tried to
use it to get timing on esp, encryption didn't show up at all
in kprof's output. trace(3) showed that encryption was 80%
of the total cpu use. in any event, i was suspecting that ilock
would be a big loser as nproc goes up, and it does appear to
be. i'm less sure that runproc is really using 62% of the cpu
c; kprof /386/9pccpu /dev/kpdata
total: 70023 in kernel text: 65773 outside kernel text: 4250
KTZERO f0100000
ms % sym
40984 62.3 runproc
9930 15.0 ilock
5720 8.6 _cycles
4360 6.6 perfticks
1784 2.7 isaconfig
1600 2.4 iunlock
cf. the 4 processor 5600 xeon:
b; kprof /386/9pccpu /dev/kpdata
total: 14416 in kernel text: 11434 outside kernel text: 2982
KTZERO f0100000
ms % sym
4036 35.2 rebalance
2483 21.7 runproc
1561 13.6 _cycles
918 8.0 perfticks
377 3.2 unlock
337 2.9 microdelay
259 2.2 idlehands
- erik
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] interesting timing tests
2010-06-20 1:36 ` erik quanstrom
@ 2010-06-20 7:44 ` Richard Miller
2010-06-20 12:45 ` erik quanstrom
0 siblings, 1 reply; 18+ messages in thread
From: Richard Miller @ 2010-06-20 7:44 UTC (permalink / raw)
To: 9fans
> in any event, i was suspecting that ilock
> would be a big loser as nproc goes up, and it does appear to
> be.
Spin locks would have been high on my list of suspects.
> i'm less sure that runproc is really using 62% of the cpu
Not impossible, given this:
Proc*
runproc(void)
{
...
/* waste time or halt the CPU */
idlehands();
...
and this:
void
idlehands(void)
{
if(conf.nmach == 1)
halt();
}
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] interesting timing tests
2010-06-20 7:44 ` Richard Miller
@ 2010-06-20 12:45 ` erik quanstrom
2010-06-20 16:51 ` Richard Miller
0 siblings, 1 reply; 18+ messages in thread
From: erik quanstrom @ 2010-06-20 12:45 UTC (permalink / raw)
To: 9fans
> Spin locks would have been high on my list of suspects.
mine, too. the 64 bit question is, which spin locks.
>
> > i'm less sure that runproc is really using 62% of the cpu
>
> Not impossible, given this:
>
> Proc*
> runproc(void)
> {
> ...
> /* waste time or halt the CPU */
> idlehands();
> ...
>
> and this:
>
> void
> idlehands(void)
> {
> if(conf.nmach == 1)
> halt();
> }
yet for one machine conf.nmach == 4 and for the
other conf.nmach == 16; neither is calling halt.
- erik
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] interesting timing tests
2010-06-20 12:45 ` erik quanstrom
@ 2010-06-20 16:51 ` Richard Miller
2010-06-20 21:55 ` erik quanstrom
0 siblings, 1 reply; 18+ messages in thread
From: Richard Miller @ 2010-06-20 16:51 UTC (permalink / raw)
To: 9fans
> yet for one machine conf.nmach == 4 and for the
> other conf.nmach == 16; neither is calling halt.
Hypothesis: with four processors there's enough work to keep all
the cpus busy. With sixteen processors you're getting i/o bound
(where's the filesystem coming from?) so some of the cpus are
idling, and would call halt if they were allowed to.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] interesting timing tests
2010-06-20 16:51 ` Richard Miller
@ 2010-06-20 21:55 ` erik quanstrom
2010-06-21 1:41 ` erik quanstrom
0 siblings, 1 reply; 18+ messages in thread
From: erik quanstrom @ 2010-06-20 21:55 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 3855 bytes --]
> > yet for one machine conf.nmach == 4 and for the
> > other conf.nmach == 16; neither is calling halt.
>
> Hypothesis: with four processors there's enough work to keep all
> the cpus busy. With sixteen processors you're getting i/o bound
> (where's the filesystem coming from?) so some of the cpus are
> idling, and would call halt if they were allowed to.
it seems to be a tad more complicated than that.
these machines "aren't doing anything"; i'm the only
one logged in, and they run no services.
16cpus; >/dev/kpctl echo startclr; sleep 5;>/dev/kpctl echo stop; \
kprof /386/9pccpu /dev/kpdata | sed 10q
total: 79782 in kernel text: 79782 outside kernel text: 0
KTZERO f0100000
ms % sym
60348 75.2 runproc
9125 11.3 _cycles
6230 7.7 perfticks
2696 3.3 isaconfig
1271 1.5 idlehands
1127 1.4 microdelay
1 0.0 freepte
4cpus; >/dev/kpctl echo startclr; sleep 5;>/dev/kpctl echo stop; \
kprof /386/9pccpu /dev/kpdata | sed 10q
total: 20327 in kernel text: 20327 outside kernel text: 0
KTZERO f0100000
ms % sym
8124 40.2 rebalance
5261 26.0 runproc
3252 16.1 _cycles
1997 9.9 perfticks
702 3.4 microdelay
548 2.7 idlehands
349 1.7 isaconfig
this trend continues with burncycles, a program
(attached) that actually does stuff on n cpus:
4cpus; for(i in 1 2 4){
>/dev/kpctl echo startclr;
>/dev/null time 8.burncycles $i;
>/dev/kpctl echo stop;
kprof /386/9pccpu /dev/kpdata|sed 10q
}
10.56u 0.00s 10.56r 8.burncycles 1
total: 42246 in kernel text: 31684 outside kernel text: 10562
KTZERO f0100000
ms % sym
12693 40.0 rebalance
8324 26.2 runproc
5215 16.4 _cycles
3182 10.0 perfticks
1088 3.4 microdelay
902 2.8 idlehands
611 1.9 isaconfig
10.56u 0.00s 10.56r 8.burncycles 2
total: 42561 in kernel text: 21441 outside kernel text: 21120
KTZERO f0100000
ms % sym
8567 39.9 rebalance
5558 25.9 runproc
3483 16.2 _cycles
2190 10.2 perfticks
742 3.4 microdelay
590 2.7 idlehands
408 1.9 isaconfig
10.56u 0.00s 10.56r 8.burncycles 4
total: 42524 in kernel text: 428 outside kernel text: 42096
KTZERO f0100000
ms % sym
159 37.1 rebalance
120 28.0 runproc
63 14.7 _cycles
49 11.4 perfticks
17 3.9 idlehands
9 2.1 isaconfig
9 2.1 microdelay
8cpus; for(i in 1 2 4 8 16){
>/dev/kpctl echo startclr;
>/dev/null time 8.burncycles $i;
>/dev/kpctl echo stop;
kprof /386/9pccpu /dev/kpdata|sed 10q
}
17.26u 0.00s 17.26r 8.burncycles 1
total: 265856 in kernel text: 248594 outside kernel text: 17262
KTZERO f0100000
ms % sym
191427 77.0 runproc
28607 11.5 _cycles
21218 8.5 perfticks
8618 3.4 isaconfig
4408 1.7 idlehands
3584 1.4 microdelay
1 0.0 nhgets
17.64u 0.00s 17.64r 8.burncycles 2
total: 276561 in kernel text: 241360 outside kernel text: 35201
KTZERO f0100000
ms % sym
181186 75.0 runproc
26816 11.1 _cycles
23267 9.6 perfticks
8049 3.3 isaconfig
4261 1.7 idlehands
3483 1.4 microdelay
2 0.0 sleep
18.87u 0.00s 18.87r 8.burncycles 4
total: 297021 in kernel text: 225113 outside kernel text: 71908
KTZERO f0100000
ms % sym
168136 74.6 runproc
24904 11.0 _cycles
22849 10.1 perfticks
7462 3.3 isaconfig
3879 1.7 idlehands
3058 1.3 microdelay
1 0.0 ilock
18.65u 0.00s 18.65r 8.burncycles 8
total: 289838 in kernel text: 148804 outside kernel text: 141034
KTZERO f0100000
ms % sym
117215 78.7 runproc
16729 11.2 _cycles
12872 8.6 perfticks
5119 3.4 isaconfig
2765 1.8 idlehands
2064 1.3 microdelay
2 0.0 sleep
19.34u 0.00s 19.35r 8.burncycles 16
total: 281308 in kernel text: -9895 outside kernel text: 291203
KTZERO f0100000
ms % sym
497 -5.0 runproc
78 0.-7 _cycles
50 0.-5 perfticks
14 0.-1 isaconfig
10 0.-1 microdelay
8 0.0 idlehands
1 0.0 ilock
- erik
[-- Attachment #2: burncycles.c --]
[-- Type: text/plain, Size: 787 bytes --]
#include <u.h>
#include <libc.h>
#include <thread.h>
#define Scale (100000000000ull)
/*
* waste time
*/
vlong
πjj(uint j)
{
vlong v;
v = 4*Scale / (2*j + 1);
if(j&1)
return -v;
return v;
}
vlong
π(void)
{
uint i;
vlong v;
v = 0;
for(i = 0; i < 500000000; i++)
v += πjj(i);
return v;
}
void
p(void *v)
{
int i;
i = (int)v;
print("%d: %lld\n", i, π());
threadexits("");
}
void
usage(void)
{
fprint(2, "usage: burncycles nthread\n");
threadexits("usage");
}
void
threadmain(int argc, char **argv)
{
int n, i;
ARGBEGIN{
default:
usage();
}ARGEND
n = 1;
else if(argc > 1 || (n = atoi(argv[0])) <= 0)
usage();
for(i = 0; i < n-1; i++)
proccreate(p, (void*)i, 4096);
p((void*)i);
}
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] interesting timing tests
2010-06-20 21:55 ` erik quanstrom
@ 2010-06-21 1:41 ` erik quanstrom
2010-06-21 3:46 ` Venkatesh Srinivas
0 siblings, 1 reply; 18+ messages in thread
From: erik quanstrom @ 2010-06-21 1:41 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 1072 bytes --]
oops. botched fix of harmess warning.
corrected source attached.
just for a giggle, i ran this test on a few handy machines to get a feel
for relative speed of a single core. since this test is small enough to
fit in the tiniest cache, i would think that memory speed or any other
external factor would be unimportant:
open rd/marvell kirkwood 471.97u 0.00s 472.25r
Intel(R) Atom(TM) CPU 330 @ 1.60GHz 48.47u 0.00s 48.48r
Intel(R) Pentium(R) 4 CPU 3.00GHz 40.72u 0.00s 40.76r
AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ 30.62u 0.00s 30.64r
AMD Athlon(tm) 64 X2 Dual Core Processor 5000+ 23.18u 0.00s 23.19r
Intel(R) Xeon(R) CPU 5120 @ 1.86GHz 23.16u 0.00s 23.08r
Intel(R) Xeon(R) CPU E5540 @ 2.53GHz 17.26u 0.00s 17.26r
Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz 16.86u 0.00s 16.86r
Intel(R) Xeon(R) CPU E5630 @ 2.53GHz 10.46u 0.00s 10.50r
perhaps the arm's vlong arithmitic isn't as well optimized as x86. the
atom also is conspicuously slow, but unfortunately with no obvious
excuses.
- erik
[-- Attachment #2: burncycles.c --]
[-- Type: text/plain, Size: 814 bytes --]
#include <u.h>
#include <libc.h>
#include <thread.h>
#define Scale (100000000000ull)
/*
* waste time
*/
vlong
πjj(uint j)
{
vlong v;
v = 4*Scale / (2*j + 1);
if(j&1)
return -v;
return v;
}
vlong
π(void)
{
uint i;
vlong v;
v = 0;
for(i = 0; i < 500000000; i++)
v += πjj(i);
return v;
}
void
p(void *v)
{
int i;
i = (int)v;
print("%d: %lld\n", i, π());
threadexits("");
}
void
usage(void)
{
fprint(2, "usage: burncycles nthread\n");
threadexits("usage");
}
void
threadmain(int argc, char **argv)
{
int n, i;
ARGBEGIN{
default:
usage();
}ARGEND
n = 0;
if(argc == 0)
n = 1;
else if(argc != 1 || (n = atoi(argv[0])) <= 0)
usage();
for(i = 0; i < n-1; i++)
proccreate(p, (void*)i, 4096);
p((void*)i);
}
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] interesting timing tests
2010-06-21 1:41 ` erik quanstrom
@ 2010-06-21 3:46 ` Venkatesh Srinivas
2010-06-21 14:40 ` erik quanstrom
0 siblings, 1 reply; 18+ messages in thread
From: Venkatesh Srinivas @ 2010-06-21 3:46 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
[-- Attachment #1: Type: text/plain, Size: 1144 bytes --]
knowing which locks would rock.
i imagine the easiest way to find out would be modify lock() to bump a
per-lock ctr on failure-to-acquire. on i386 lock add would be the easiest
way to do that, i think. add an 'inited' field to the spinlock and a list
linkage as well, to allow for easy examination when you hit the system with
acid.
also if the locks in question need to be locked and the resources they
protect cannot be split, we can do much better than our current spinlocks:
void lock(int *l) {
int old = __sync_fetch_and_add(l, 1);
short next,owner;
do {
next = old & 0x0000FFFF;
owner = (old >> 16) & 0x0000FFFF;
old = *l;
} while(next != owner);
}
void unlock(int *l) {
__sync_fetch_and_add(l, (1 << 16));
}
(this is in gcc-C, but porting wouldn't be bad; the unlock
__sync_fetch_and_add would be LOCK ADD on i386. the __sync_fetch_and_add in
lock would be LOCK XADD on i386. i don't know 8a's syntax well enough to do
this right, in particular how 8a's pseudoregs work).
(many credits to nick piggin for this lock design. its totally rad.)
-- vs
[-- Attachment #2: Type: text/html, Size: 1528 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] interesting timing tests
2010-06-21 3:46 ` Venkatesh Srinivas
@ 2010-06-21 14:40 ` erik quanstrom
2010-06-21 16:42 ` Venkatesh Srinivas
2010-06-21 16:43 ` erik quanstrom
0 siblings, 2 replies; 18+ messages in thread
From: erik quanstrom @ 2010-06-21 14:40 UTC (permalink / raw)
To: 9fans
void
lock(ulong *l)
{
ulong old;
ushort next, owner;
old = _xadd(l, 1);
for(;;){
next = old;
owner = old>>16;
old = *l;
if(next == owner)
break;
}
}
void
unlock(ulong *l)
{
_xadd(l, 1<<16);
}
- erik
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] interesting timing tests
2010-06-21 14:40 ` erik quanstrom
@ 2010-06-21 16:42 ` Venkatesh Srinivas
2010-06-21 16:43 ` erik quanstrom
1 sibling, 0 replies; 18+ messages in thread
From: Venkatesh Srinivas @ 2010-06-21 16:42 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
[-- Attachment #1: Type: text/plain, Size: 857 bytes --]
On Mon, Jun 21, 2010 at 10:40 AM, erik quanstrom <quanstro@quanstro.net>wrote:
> void
> lock(ulong *l)
> {
> ulong old;
> ushort next, owner;
>
> old = _xadd(l, 1);
> for(;;){
> next = old;
> owner = old>>16;
> old = *l;
> if(next == owner)
> break;
> }
> }
>
> void
> unlock(ulong *l)
> {
> _xadd(l, 1<<16);
> }
Sure, that's reasonable in C; (i wasn't sure how to do it in asm for 8_a_,
that was what I was asking about). Just also remember to provide xadd; the
distribution 8a and 8l didn't support it last I checked.
Just another observation, we can bypass the load of old in the uncontended
case if we reverse old = *l and the compare/break in lock.
Anyway, thoughts on this lock?
-- vs
[-- Attachment #2: Type: text/html, Size: 1332 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] interesting timing tests
2010-06-21 14:40 ` erik quanstrom
2010-06-21 16:42 ` Venkatesh Srinivas
@ 2010-06-21 16:43 ` erik quanstrom
1 sibling, 0 replies; 18+ messages in thread
From: erik quanstrom @ 2010-06-21 16:43 UTC (permalink / raw)
To: 9fans
On Mon Jun 21 10:51:30 EDT 2010, quanstro@quanstro.net wrote:
> void
> lock(ulong *l)
somehow lost was an observation that since lock
is only testing that next == owner, and that both
are based on the current state of *l, i don't see how
this is robust in the face of more than one mach
spinning. who wins? am i missing something?
also lost was the assembly which should be (from
memory) something like
TEXT _xadd(SB), 1, $0
MOVL l+0(FP), BX
MOVL n+4(FP), AX
LOCK; XADD AX, 0(BX)
RET
unfortunately that is not accepted by the assembler,
and (hopefully) equivalent BYTE statements were
rejected by the linker. perhaps someone knows immediately
how to sneak XADD in; i haven't yet investigated.
- erik
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] interesting timing tests
2010-06-18 23:26 [9fans] interesting timing tests erik quanstrom
2010-06-19 13:42 ` Richard Miller
@ 2010-06-21 21:11 ` Bakul Shah
2010-06-21 21:21 ` erik quanstrom
2010-06-22 3:24 ` Lawrence E. Bakst
2 siblings, 1 reply; 18+ messages in thread
From: Bakul Shah @ 2010-06-21 21:11 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
On Fri, 18 Jun 2010 19:26:25 EDT erik quanstrom <quanstro@labs.coraid.com> wrote:
> note the extreme system time on the 16 processor machine
Could this be due to memory contention caused by spinlocks?
While locks are spinning they eat up memory bandwidth which
slows down everyone's memory accesses (including the one who
is trying to finish its work while holding the spinlock).
And the more processors contend, the worse it gets....
How well does plan9 lock() scale with the number of processor?
Since this is analogous to accessing a CSMA network, one can
use a similar algorithm to ameliorate the bandwidth problem:
if you didn't get the lock, assume it will be a little while
before you can get it so you might as well backoff. There is
an old paper that talks about this.
Or it could simply be due to caching behaviour, if everyone
is accessing/mutating the same pages at the same time.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] interesting timing tests
2010-06-21 21:11 ` Bakul Shah
@ 2010-06-21 21:21 ` erik quanstrom
2010-06-21 21:47 ` Bakul Shah
0 siblings, 1 reply; 18+ messages in thread
From: erik quanstrom @ 2010-06-21 21:21 UTC (permalink / raw)
To: 9fans
> > note the extreme system time on the 16 processor machine
>
> Could this be due to memory contention caused by spinlocks?
> While locks are spinning they eat up memory bandwidth which
> slows down everyone's memory accesses (including the one who
> is trying to finish its work while holding the spinlock).
> And the more processors contend, the worse it gets....
perhaps.
> How well does plan9 lock() scale with the number of processor?
i think the question is, are there any spin locks that can become
unreasonablly contended as conf.nmach goes up. if so, i would
think that rather than finding the optimal solution to pessimal
use of spinlocks, we should look to optimize our use of spinlocks.
the underlying assumption is that the contended case is rare.
if this is not the case, then spin locks are not a good choice.
- erik
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] interesting timing tests
2010-06-21 21:21 ` erik quanstrom
@ 2010-06-21 21:47 ` Bakul Shah
2010-06-21 22:16 ` erik quanstrom
0 siblings, 1 reply; 18+ messages in thread
From: Bakul Shah @ 2010-06-21 21:47 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
On Mon, 21 Jun 2010 17:21:36 EDT erik quanstrom <quanstro@quanstro.net> wrote:
> > > note the extreme system time on the 16 processor machine
> >
> > Could this be due to memory contention caused by spinlocks?
> > While locks are spinning they eat up memory bandwidth which
> > slows down everyone's memory accesses (including the one who
> > is trying to finish its work while holding the spinlock).
> > And the more processors contend, the worse it gets....
>
> perhaps.
Is there a way to check this?
Is there a way to completely shut off N processors and
measure benchmark speed slow down as function of processor?
> > How well does plan9 lock() scale with the number of processor?
>
> i think the question is, are there any spin locks that can become
> unreasonablly contended as conf.nmach goes up. if so, i would
> think that rather than finding the optimal solution to pessimal
> use of spinlocks, we should look to optimize our use of spinlocks.
I mentioned this as something to check but I wouldn't be
surprised the problem is a combination of factors. So first
you have to find out if this is the problem in your case
before worrying about it.
> the underlying assumption is that the contended case is rare.
> if this is not the case, then spin locks are not a good choice.
With 8 dual HT processors the probability has gone up quite a
bit!
And what will you replace spinlocks with? The underlying
issue is contention due to sharing. If you can reduce sharing
you can reduce contention. Backoff alg. seems promising
because it can reduce memory access where it matters most.
For tens & tens of processors or more, message passing is the
only way but that would be a major redesign!
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] interesting timing tests
2010-06-21 21:47 ` Bakul Shah
@ 2010-06-21 22:16 ` erik quanstrom
0 siblings, 0 replies; 18+ messages in thread
From: erik quanstrom @ 2010-06-21 22:16 UTC (permalink / raw)
To: 9fans
> Is there a way to check this?
>
> Is there a way to completely shut off N processors and
> measure benchmark speed slow down as function of processor?
there hasn't been any performance impact measured.
however, the extreme system time still seems wierd.
richard miller suggested that kprof might be suffering from sampleing
error, and moving rebalance to the end of the file confirmed. both
the 4- and 16- processor machines have similar behavior.
> > the underlying assumption is that the contended case is rare.
> > if this is not the case, then spin locks are not a good choice.
>
> With 8 dual HT processors the probability has gone up quite a
> bit!
only if contention depends on processing speed. (locks may be
interrupted if they're not ilocks.)
for example, a lock protecting packet rx for an ethernet
driver would not depend strongly on the number of processors or
processing speed.
> And what will you replace spinlocks with?
the right answer here is likely "mu". some spinlocks
might need to be replaced with another structure.
but i would think that would depend entirely on the
situation. in general, i am not suggesting
depricating spinlocks.
- erik
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] interesting timing tests
2010-06-18 23:26 [9fans] interesting timing tests erik quanstrom
2010-06-19 13:42 ` Richard Miller
2010-06-21 21:11 ` Bakul Shah
@ 2010-06-22 3:24 ` Lawrence E. Bakst
2010-06-23 1:09 ` erik quanstrom
2 siblings, 1 reply; 18+ messages in thread
From: Lawrence E. Bakst @ 2010-06-22 3:24 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
Do you have a way to turn off one of the sockets on "c" (2 x E5540) and get the numbers with HT (8 processors) and without HT (4 processors)? It would also be interesting to see "c" with HT turned off.
Certainly it seems to me that idlehands needs to be fixed, your bit array "active.schedwait" is one way.
In my experience of bringing up the Alliant FX/8 mini-supercomputer which had 8 (mostly CPU) + 12 (mostly I/O) = 20 processors there were a bunch of details that had to addressed as we went from 1 to 20 processors. There were even some issues with the system timing (user, sys, real) itself being messed up, but I can't remember the details.
I do remember one customer that had a billing system complain that they had their own customers complaining that in high I/O environments they were getting charged for interrupts (included in sys at the time) they didn't incur, which was true. I think we fixed that one, by having an idle process per CPU and charging each interrupt to the processor idle process.
I mention it because we got a lot of mileage out of the decision to give every processor an idle process. Our scheduler was set up to only run that process if there were no other processes available for that processor. When the idle process did run it did a few things and then called halt. There is some more to the story and if anyone is interested, let me know and I'll either post a follow up or I can respond in private.
We used to have a saying at Alliant: "Data drives out speculation".
leb
At 7:26 PM -0400 6/18/10, erik quanstrom wrote:
>note the extreme system time on the 16 processor machine
>
>a 2 * Intel(R) Xeon(R) CPU 5120 @ 1.86GHz
>b 4 * Intel(R) Xeon(R) CPU E5630 @ 2.53GHz
>c 16* Intel(R) Xeon(R) CPU E5540 @ 2.53GHz
--
leb@iridescent.org
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] interesting timing tests
2010-06-22 3:24 ` Lawrence E. Bakst
@ 2010-06-23 1:09 ` erik quanstrom
0 siblings, 0 replies; 18+ messages in thread
From: erik quanstrom @ 2010-06-23 1:09 UTC (permalink / raw)
To: 9fans
> Do you have a way to turn off one of the sockets on "c" (2 x E5540) and get the numbers with HT (8 processors) and without HT (4 processors)? It would also be interesting to see "c" with HT turned off.
here's the progression
4 4.41u 1.83s 4.06r 0. %ilock
8 4.47u 2.37s 3.60r 2.0
12 4.49u 8.34s 4.40r 11.0
16 4.36u 13.16s 4.43r 14.7
here's a fun little calculation:
16 threads * 4.43 s * 0.147 + 1.83s baseline
= 10.41936 + 1.83 thread*s
= 12.25s
it seems that increased ilock contention is a big factor
in the increase in system time.
ilock accounting has most (>80%) long-held ilocks
(>8.5µs, ~21k cycles) starting here /sys/src/libc/port/pool.c:1318.
this is no surprise. technically, a long-held ilock is not
really a problem—until somebody else wants it. but we
can be fairly certain that allocb/malloc is a fairly contended code
path.
hopefully i'll be able to test a less-contended replacement for
allocb/freeb before i run out of time with this machine.
> Certainly it seems to me that idlehands needs to be fixed,
> your bit array "active.schedwait" is one way.
i'm not convinced that idlehands is anything but a power-waster.
performance wise, it's nearly ideal.
- erik
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2010-06-23 1:09 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-18 23:26 [9fans] interesting timing tests erik quanstrom
2010-06-19 13:42 ` Richard Miller
2010-06-20 1:36 ` erik quanstrom
2010-06-20 7:44 ` Richard Miller
2010-06-20 12:45 ` erik quanstrom
2010-06-20 16:51 ` Richard Miller
2010-06-20 21:55 ` erik quanstrom
2010-06-21 1:41 ` erik quanstrom
2010-06-21 3:46 ` Venkatesh Srinivas
2010-06-21 14:40 ` erik quanstrom
2010-06-21 16:42 ` Venkatesh Srinivas
2010-06-21 16:43 ` erik quanstrom
2010-06-21 21:11 ` Bakul Shah
2010-06-21 21:21 ` erik quanstrom
2010-06-21 21:47 ` Bakul Shah
2010-06-21 22:16 ` erik quanstrom
2010-06-22 3:24 ` Lawrence E. Bakst
2010-06-23 1:09 ` erik quanstrom
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).