Re: [9fans] etherigbe.c using

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* Re: [9fans] etherigbe.c using _xinc?
       [not found] <<dd6fe68a0912081552s30851f04n109e56479bb423cb@mail.gmail.com>
@ 2009-12-09  0:32 ` erik quanstrom
  2009-12-09  1:05   ` Russ Cox
  0 siblings, 1 reply; 9+ messages in thread
From: erik quanstrom @ 2009-12-09  0:32 UTC (permalink / raw)
  To: 9fans

> but the former does two operations and the latter
> only one.  your claim was that _xinc is slower
> than incref (== lock(), x++, unlock()).  but you are
> timing xinc+xdec against incref.

sure.  i was looking it as a kernel version of a
semaphore.

back to the original problem, before allocb/freeb
did 2 lock/unlocks.  now it does 2 unlock/locks
+ 2 xinc/xdec, and is, in the best case 31% slower.
and in the worst case 90% slower.  the reference
counting is a heavy price to pay on every network
block, when it is only used by ip/gre.c.

- erik



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] etherigbe.c using _xinc?
  2009-12-09  0:32 ` [9fans] etherigbe.c using _xinc? erik quanstrom
@ 2009-12-09  1:05   ` Russ Cox
  0 siblings, 0 replies; 9+ messages in thread
From: Russ Cox @ 2009-12-09  1:05 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Tue, Dec 8, 2009 at 4:32 PM, erik quanstrom <quanstro@quanstro.net> wrote:
>> but the former does two operations and the latter
>> only one.  your claim was that _xinc is slower
>> than incref (== lock(), x++, unlock()).  but you are
>> timing xinc+xdec against incref.
>
> sure.  i was looking it as a kernel version of a
> semaphore.

no, your original claim was that incref/decref
was faster than _xinc/_xdec.  the numbers
don't support that claim.

> the reference
> counting is a heavy price to pay on every network
> block, when it is only used by ip/gre.c.

has the network gotten fast enough that an extra
bus transaction per block slows it down?
it seems like gigabit ethernet would be around
100k packets per second, so the extra 50ns
or so per packet would be 5ms per second in
practice, which is significantly but hardly
seems prohibitive.

> before allocb/freeb
> did 2 lock/unlocks.  now it does 2 unlock/locks
> + 2 xinc/xdec, and is, in the best case 31% slower.
> and in the worst case 90% slower.

i don't know how you get those numbers but
anything even approaching that would mean that
the kernel is spending all its time in igberballoc,
at which point you probably have other things
to fix.

russ

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <<dd6fe68a0912081705q6fad8a6cl20ca648397070dae@mail.gmail.com>]

* Re: [9fans] etherigbe.c using _xinc?
       [not found] <<dd6fe68a0912081705q6fad8a6cl20ca648397070dae@mail.gmail.com>
@ 2009-12-09  2:04 ` erik quanstrom
  0 siblings, 0 replies; 9+ messages in thread
From: erik quanstrom @ 2009-12-09  2:04 UTC (permalink / raw)
  To: 9fans

> has the network gotten fast enough that an extra
> bus transaction per block slows it down?
> it seems like gigabit ethernet would be around
> 100k packets per second, so the extra 50ns
> or so per packet would be 5ms per second in
> practice, which is significantly but hardly
> seems prohibitive.

i'm working with 10gbe.  pcie 2.0 is making 2x10gbe
attractive.  multiply by 10 or 20.  and if you're doing a
request/response, multiply by 2 again.

- erik



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [9fans] etherigbe.c using _xinc?
@ 2009-12-08 16:25 Venkatesh Srinivas
  2009-12-08 16:36 ` erik quanstrom
  0 siblings, 1 reply; 9+ messages in thread
From: Venkatesh Srinivas @ 2009-12-08 16:25 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Hi,

I noticed etherigbe.c (in igberballoc) was recently changed to
increment the refcount on the block it allocates. Any reason it uses
_xinc rather than incref?

-- vs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] etherigbe.c using _xinc?
  2009-12-08 16:25 Venkatesh Srinivas
@ 2009-12-08 16:36 ` erik quanstrom
  2009-12-08 19:35   ` Russ Cox
  0 siblings, 1 reply; 9+ messages in thread
From: erik quanstrom @ 2009-12-08 16:36 UTC (permalink / raw)
  To: 9fans

On Tue Dec  8 11:28:30 EST 2009, me@acm.jhu.edu wrote:
> Hi,
>
> I noticed etherigbe.c (in igberballoc) was recently changed to
> increment the refcount on the block it allocates. Any reason it uses
> _xinc rather than incref?
>
> -- vs

because it's not a Ref.  unfortunately, if it were
a Ref, it would be much faster.  _xinc is deadly
slow even if there is no contention on x86.

i wish the ref counting had at least been isolated to the
case that needs them.  blocks in queues typically
have one owner.  so the owner of the block assumes
it can modify the block whenever with no locking.
ref counting means this assumption is false.
i'm not sure how your supposed to wlock a block.

- erik

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] etherigbe.c using _xinc?
  2009-12-08 16:36 ` erik quanstrom
@ 2009-12-08 19:35   ` Russ Cox
  2009-12-08 19:52     ` John Floren
  2009-12-08 20:00     ` erik quanstrom
  0 siblings, 2 replies; 9+ messages in thread
From: Russ Cox @ 2009-12-08 19:35 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> because it's not a Ref.  unfortunately, if it were
> a Ref, it would be much faster.  _xinc is deadly
> slow even if there is no contention on x86.

do you have numbers to back up this claim?

you are claiming that the locked XCHGL
in tas (pc/l.s) called from lock (port/taslock.c)
called from incref (port/chan.c) is "much faster"
than the locked INCL in _xinc (pc/l.s).
it seems to me that a locked memory bus
is a locked memory bus.

also, when up != nil (a common condition),
lock does a locked INCL and DECL
(_xinc and _xdec) in addition to the tas,
which seems like strictly more work than
a single _xinc.

russ

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] etherigbe.c using _xinc?
  2009-12-08 19:35   ` Russ Cox
@ 2009-12-08 19:52     ` John Floren
  2009-12-08 20:00     ` erik quanstrom
  1 sibling, 0 replies; 9+ messages in thread
From: John Floren @ 2009-12-08 19:52 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Tue, Dec 8, 2009 at 2:35 PM, Russ Cox <rsc@swtch.com> wrote:
>> because it's not a Ref.  unfortunately, if it were
>> a Ref, it would be much faster.  _xinc is deadly
>> slow even if there is no contention on x86.
>
> do you have numbers to back up this claim?
>

I don't have the code or the numbers in front of me, but I recall
seeing quite a bit of speed improvement when I experimentally replaced
incref/decref with direct calls to _xinc/_xdec. I don't remember what
the test was, but I do remember that I got something like 35%
improvement on it. I ran that kernel on my terminal for the rest of
the summer without trouble; while I didn't notice a blazing speed
increase, it didn't slow me down either.

John
-- 
"Object-oriented design is the roman numerals of computing" -- Rob Pike

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] etherigbe.c using _xinc?
  2009-12-08 19:35   ` Russ Cox
  2009-12-08 19:52     ` John Floren
@ 2009-12-08 20:00     ` erik quanstrom
  2009-12-08 23:52       ` Russ Cox
  1 sibling, 1 reply; 9+ messages in thread
From: erik quanstrom @ 2009-12-08 20:00 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 883 bytes --]

> do you have numbers to back up this claim?
>
> you are claiming that the locked XCHGL
> in tas (pc/l.s) called from lock (port/taslock.c)
> called from incref (port/chan.c) is "much faster"
> than the locked INCL in _xinc (pc/l.s).
> it seems to me that a locked memory bus
> is a locked memory bus.

yes, i do.  xinc on most modern intel is a real
loss.  and a moderate loss on amd.  my atom 330
is an exception.

intel core i7 2.4ghz
loop	0 nsec/call
loopxinc	20 nsec/call
looplock	11 nsec/call

intel 5000 1.6ghz
loop	0 nsec/call
loopxinc	44 nsec/call
looplock	25 nsec/call

intel atom 330 1.6ghz (exception!)
loop	2 nsec/call
loopxinc	14 nsec/call
looplock	22 nsec/call

amd k10 2.0ghz
loop	2 nsec/call
loopxinc	30 nsec/call
looplock	20 nsec/call

intel p4 xeon 3.0ghz

loop	1 nsec/call
loopxinc	76 nsec/call
looplock	42 nsec/call

- erik

[-- Attachment #2: xinc.s --]
[-- Type: text/plain, Size: 286 bytes --]

TEXT _xinc(SB), 1, $0				/* void _xinc(long*); */
	MOVL	l+0(FP), AX
	LOCK;	INCL 0(AX)
	RET

TEXT _xdec(SB), 1, $0				/* long _xdec(long*); */
	MOVL	l+0(FP), BX
	XORL	AX, AX
	LOCK;	DECL 0(BX)
	JLT	_xdeclt
	JGT	_xdecgt
	RET
_xdecgt:
	INCL	AX
	RET
_xdeclt:
	DECL	AX
	RET

[-- Attachment #3: timing.c --]
[-- Type: text/plain, Size: 699 bytes --]

#include <u.h>
#include <libc.h>

void	_xinc(uint*);
void	_xdec(uint*);

enum {
	N	= 1<<30,
};

void
loop(void)
{
	uint i;

	for(i = 0; i < N; i++)
		;
}

void
loopxinc(void)
{
	uint i, x;

	for(i = 0; i < N; i++){
		_xinc(&x);
		_xdec(&x);
	}
}

void
looplock(void)
{
	uint i;
	static Lock l;

	for(i = 0; i < N; i++){
		lock(&l);
		unlock(&l);
	}
}

void
timing(char *s, void (*f)(void))
{
	uvlong t[2];

	t[0] = nsec();
	f();
	t[1] = nsec();
	fprint(2, "%s\t%llud nsec/call\n", s, (t[1] - t[0])/(uvlong)N);
}

void
main(void)
{
	nsec();
	timing("loop", loop);
	timing("loopxinc", loopxinc);
	timing("looplock", looplock);
	exits("");
}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] etherigbe.c using _xinc?
  2009-12-08 20:00     ` erik quanstrom
@ 2009-12-08 23:52       ` Russ Cox
  0 siblings, 0 replies; 9+ messages in thread
From: Russ Cox @ 2009-12-08 23:52 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

it looks like you are comparing these two functions

void
loopxinc(void)
{
	uint i, x;

	for(i = 0; i < N; i++){
		_xinc(&x);
		_xdec(&x);
	}
}

void
looplock(void)
{
	uint i;
	static Lock l;

	for(i = 0; i < N; i++){
		lock(&l);
		unlock(&l);
	}
}

but the former does two operations and the latter
only one.  your claim was that _xinc is slower
than incref (== lock(), x++, unlock()).  but you are
timing xinc+xdec against incref.

assuming xinc and xdec are approximately the same
cost (so i can just halve the numbers for loopxinc),
that would make the fair comparison produce:

intel core i7 2.4ghz
loop    0 nsec/call
loopxinc        10 nsec/call  // was 20
looplock        11 nsec/call

intel 5000 1.6ghz
loop    0 nsec/call
loopxinc        22 nsec/call  // was 44
looplock        25 nsec/call

intel atom 330 1.6ghz (exception!)
loop    2 nsec/call
loopxinc        7 nsec/call  // was 14
looplock        22 nsec/call

amd k10 2.0ghz
loop    2 nsec/call
loopxinc        15 nsec/call  // was 30
looplock        20 nsec/call

intel p4 xeon 3.0ghz

loop    1 nsec/call
loopxinc        38 nsec/call  // was 76
looplock        42 nsec/call

which looks like a much different story.

russ


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-12-09  2:04 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <<dd6fe68a0912081552s30851f04n109e56479bb423cb@mail.gmail.com>
2009-12-09  0:32 ` [9fans] etherigbe.c using _xinc? erik quanstrom
2009-12-09  1:05   ` Russ Cox
     [not found] <<dd6fe68a0912081705q6fad8a6cl20ca648397070dae@mail.gmail.com>
2009-12-09  2:04 ` erik quanstrom
2009-12-08 16:25 Venkatesh Srinivas
2009-12-08 16:36 ` erik quanstrom
2009-12-08 19:35   ` Russ Cox
2009-12-08 19:52     ` John Floren
2009-12-08 20:00     ` erik quanstrom
2009-12-08 23:52       ` Russ Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).