[9front] Speeding up snoopy(8)

9front - general discussion about 9front
 help / color / mirror / Atom feed

* [9front] Speeding up snoopy(8)
@ 2022-10-18  4:57 droyo
  2022-10-20 15:55 ` David Arroyo
  0 siblings, 1 reply; 10+ messages in thread
From: droyo @ 2022-10-18  4:57 UTC (permalink / raw)
  To: 9front

I am a plan 9 novice, especially when it comes to performance, so
thank you in advance for your patience.

I have been trying to diagnose some interesting behavior in 9front's
TCP where it appears to be sending a handful of TCP segments
out-of-order every 1 second or so at gigabit rates, and this is a
of a detour from that.

I setup a 9front VM on qemu with 2 cores and 2GB of memory.  I have
not tried to tune the VM parameters at all, and it's very possible
that I'd get different results on real hardware.

I did a trivial tcp benchmark from the 9front VM:

	aux/listen1 tcp!*!9999 cat /dev/zero

And connected to the VM from its host.  This easily achieves more than
gigabit speeds.  The client and server are using sub-interfaces on the
same physical NIC, so latency is very low.

However, when I attempted to capture traffic on the 9front side with

	ramfs; snoopy -D -f 'tcp(sd=9999)' /net/ether0 >/tmp/pcap

I found that about 50% of the packets were not recorded (I knew this
because I compared it with a capture I took at the client).  With help
from IRC, I could see soft overflows incrementing in the
/net/ether0/$index/stats file.  I patched the stats command so I could
watch it during my benchmarks.

While cinap offered some alternative filtering options like ipmux, I
would like to try to improve snoopy's performance.  I tried profiling
snoopy with

	pid=`{ psu | awk '/snoopy/ { print $2 }' }
	echo profile > /proc/$pid/ctl
	while() { tprof $pid ; sleep 5 }

Here is a sample during the benchmark, with snoopy's output redirected
to /dev/null to rule out disk or ramfs bottlenecks:

	total: 1150
	TEXT 00200000
	    ms      %   sym
	   360	 31.3	pread
	   230	 20.0	pwrite
	   120	 10.4	nsec
	   110	  9.5	_filterpkt
	    70	  6.0	tracepkt
	    60	  5.2	p_filter
	    60	  5.2	defaultframer
	    50	  4.3	p_filter
	    40	  3.4	be2vlong
	    20	  1.7	read
	    10	  0.8	write
	    10	  0.8	newfilter
	    10	  0.8	main

You can see a full 50% of time is spent in pread/pwrite.  I patched
snoopy to dial /net/ether0!-2 instead of /net/ether0!-1, which
truncates the frames to the first 64 bytes, and the overflows went
away.

The first 64 bytes are generally the most useful bytes, so an easy
win would be to use the "-2" dial option when snoopy's "-M" flag is
<=64.

More generally, though, I think snoopy could be much faster if it
read multiple frames per pread() and wrote multiple records per
pwrite(). I am a bit surprised to see pwrite() so high even when
I am redirecting to /dev/null.

It's trivial enough to buffer snoopy's output with bio(2). But I
don't know of an existing interface that would let me read more than
one frame at a time. Is there one? The only other option I can think
of to reduce the effect of pread() would be to move the pread to
a separate thread. But I'm curious to hear your thoughts.

Thanks!
David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] Speeding up snoopy(8)
  2022-10-18  4:57 [9front] Speeding up snoopy(8) droyo
@ 2022-10-20 15:55 ` David Arroyo
  2022-10-20 16:03   ` Sigrid Solveig Haflínudóttir
  0 siblings, 1 reply; 10+ messages in thread
From: David Arroyo @ 2022-10-20 15:55 UTC (permalink / raw)
  To: 9front

On Tue, Oct 18, 2022, at 00:57, droyo@aqwari.net wrote:
> 	total: 1150
> 	TEXT 00200000
> 	    ms      %   sym
> 	   360	 31.3	pread
> 	   230	 20.0	pwrite
> 	   120	 10.4	nsec
> 	   110	  9.5	_filterpkt
> 	    70	  6.0	tracepkt
> 	    60	  5.2	p_filter
> 	    60	  5.2	defaultframer
> 	    50	  4.3	p_filter
> 	    40	  3.4	be2vlong
> 	    20	  1.7	read
> 	    10	  0.8	write
> 	    10	  0.8	newfilter
> 	    10	  0.8	main
> ... snip ...
> More generally, though, I think snoopy could be much faster if it
> read multiple frames per pread() and wrote multiple records per
> pwrite(). 

I thought about this some more, and I realize that this approach is
premature at best and plain wrong at worst. Snoopy should spend as
much time as possible in pread, because that is when it is
depleting the queue. I should be trying to reduce the time spent
in everything *except* read, and only then should I be worrying
about the per-packet pread overhead.

Reading multiple packets at a time would, however, reduce the number
of nsec() calls.

I will experiment with putting the pwrite in a separate thread, and
look for small performance improvements in the other functions. I
am not sure why nsec() time is so high, but I think that's a function
that could perform very differently on real vs virtual hardware, so
I'm hesitant to try to improve it until I get 9front running on a
real machine.

David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] Speeding up snoopy(8)
  2022-10-20 15:55 ` David Arroyo
@ 2022-10-20 16:03   ` Sigrid Solveig Haflínudóttir
  2022-10-21 15:38     ` David Arroyo
  0 siblings, 1 reply; 10+ messages in thread
From: Sigrid Solveig Haflínudóttir @ 2022-10-20 16:03 UTC (permalink / raw)
  To: 9front

Quoth David Arroyo <droyo@aqwari.net>:
> I will experiment with putting the pwrite in a separate thread, and
> look for small performance improvements in the other functions. I
> am not sure why nsec() time is so high, but I think that's a function
> that could perform very differently on real vs virtual hardware, so
> I'm hesitant to try to improve it until I get 9front running on a
> real machine.

nsec() is a function that results in a syscall.  It is also affected
by time adjustments, to my knowledge.  At some point I replaced nsec()
call with a (somewhat dirty) nanosec() in vmx, see
/sys/src/cmd/vmx/nanosec.c

It would be nice to have a way to get monotonic timestamps in userspace
without having to go into the kernel. For now, you can try nanosec and
see if anything changes.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] Speeding up snoopy(8)
  2022-10-20 16:03   ` Sigrid Solveig Haflínudóttir
@ 2022-10-21 15:38     ` David Arroyo
  2022-10-21 16:04       ` ori
  2022-10-21 17:40       ` Lyndon Nerenberg (VE7TFX/VE6BBM)
  0 siblings, 2 replies; 10+ messages in thread
From: David Arroyo @ 2022-10-21 15:38 UTC (permalink / raw)
  To: 9front

On Thu, Oct 20, 2022, at 12:03, Sigrid Solveig Haflínudóttir wrote:
> It would be nice to have a way to get monotonic timestamps in userspace
> without having to go into the kernel. For now, you can try nanosec and
> see if anything changes.

I tried using nanosec, and that was sufficient for snoopy to keep up
with a bit rate of 1.6 Gbps, with no overflows. Thank you for the tip!

David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] Speeding up snoopy(8)
  2022-10-21 15:38     ` David Arroyo
@ 2022-10-21 16:04       ` ori
  2022-10-21 16:54         ` David Arroyo
  2022-10-21 17:40       ` Lyndon Nerenberg (VE7TFX/VE6BBM)
  1 sibling, 1 reply; 10+ messages in thread
From: ori @ 2022-10-21 16:04 UTC (permalink / raw)
  To: 9front

Quoth David Arroyo <droyo@aqwari.net>:
> On Thu, Oct 20, 2022, at 12:03, Sigrid Solveig Haflínudóttir wrote:
> > It would be nice to have a way to get monotonic timestamps in userspace
> > without having to go into the kernel. For now, you can try nanosec and
> > see if anything changes.
> 
> I tried using nanosec, and that was sufficient for snoopy to keep up
> with a bit rate of 1.6 Gbps, with no overflows. Thank you for the tip!
> 
> David

...should we replace the nsec() function with it?

we may be able to stash nsec (and monotonic nsec?) at last
context switch into tos, and interpolate using cycles.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] Speeding up snoopy(8)
  2022-10-21 16:04       ` ori
@ 2022-10-21 16:54         ` David Arroyo
  2022-10-21 17:10           ` ori
  0 siblings, 1 reply; 10+ messages in thread
From: David Arroyo @ 2022-10-21 16:54 UTC (permalink / raw)
  To: 9front

On Fri, Oct 21, 2022, at 12:04, ori@eigenstate.org wrote:
> Quoth David Arroyo <droyo@aqwari.net>:
>> On Thu, Oct 20, 2022, at 12:03, Sigrid Solveig Haflínudóttir wrote:
>> > It would be nice to have a way to get monotonic timestamps in userspace
>> > without having to go into the kernel. For now, you can try nanosec and
>> > see if anything changes.
>> 
>> I tried using nanosec, and that was sufficient for snoopy to keep up
>> with a bit rate of 1.6 Gbps, with no overflows. Thank you for the tip!
>
> ...should we replace the nsec() function with it?
>
> we may be able to stash nsec (and monotonic nsec?) at last
> context switch into tos, and interpolate using cycles.

I don't fully understand the tradeoffs in using nanosec() over nsec().
For packet captures, I specifically *don't* want time adjustments during
a capture, so nanosec() being monotonic is a plus.

It doesn't matter for snoopy currently, but is nanosec() thread-safe?
It's keeping xstart in a static variable; would different threads have
different xstarts, or could they step on each other if they call
nanosec() at the same time?

David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] Speeding up snoopy(8)
  2022-10-21 16:54         ` David Arroyo
@ 2022-10-21 17:10           ` ori
  0 siblings, 0 replies; 10+ messages in thread
From: ori @ 2022-10-21 17:10 UTC (permalink / raw)
  To: 9front

Quoth David Arroyo <droyo@aqwari.net>:
> On Fri, Oct 21, 2022, at 12:04, ori@eigenstate.org wrote:
> > Quoth David Arroyo <droyo@aqwari.net>:
> >> On Thu, Oct 20, 2022, at 12:03, Sigrid Solveig Haflínudóttir wrote:
> >> > It would be nice to have a way to get monotonic timestamps in userspace
> >> > without having to go into the kernel. For now, you can try nanosec and
> >> > see if anything changes.
> >> 
> >> I tried using nanosec, and that was sufficient for snoopy to keep up
> >> with a bit rate of 1.6 Gbps, with no overflows. Thank you for the tip!
> >
> > ...should we replace the nsec() function with it?
> >
> > we may be able to stash nsec (and monotonic nsec?) at last
> > context switch into tos, and interpolate using cycles.
> 
> I don't fully understand the tradeoffs in using nanosec() over nsec().
> For packet captures, I specifically *don't* want time adjustments during
> a capture, so nanosec() being monotonic is a plus.
> 
> It doesn't matter for snoopy currently, but is nanosec() thread-safe?
> It's keeping xstart in a static variable; would different threads have
> different xstarts, or could they step on each other if they call
> nanosec() at the same time?
> 
> David

it is not; as it stands, it's not a drop-in replacement for nsec();
doing it properly would mean extending the Tos struct, I think.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] Speeding up snoopy(8)
  2022-10-21 15:38     ` David Arroyo
  2022-10-21 16:04       ` ori
@ 2022-10-21 17:40       ` Lyndon Nerenberg (VE7TFX/VE6BBM)
  2022-10-21 18:52         ` ori
  1 sibling, 1 reply; 10+ messages in thread
From: Lyndon Nerenberg (VE7TFX/VE6BBM) @ 2022-10-21 17:40 UTC (permalink / raw)
  To: 9front, David Arroyo

This is coming completely out of left field, but ...

On machines that support atomic 64-bit read/write, would
it be possible to map a single hardwired non-cacheable
page into every process's address space and store the
tick counter there?

--lyndon

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] Speeding up snoopy(8)
  2022-10-21 17:40       ` Lyndon Nerenberg (VE7TFX/VE6BBM)
@ 2022-10-21 18:52         ` ori
  2022-10-21 19:23           ` ori
  0 siblings, 1 reply; 10+ messages in thread
From: ori @ 2022-10-21 18:52 UTC (permalink / raw)
  To: 9front, droyo

Quoth Lyndon Nerenberg (VE7TFX/VE6BBM) <lyndon@orthanc.ca>:
> This is coming completely out of left field, but ...
> 
> On machines that support atomic 64-bit read/write, would
> it be possible to map a single hardwired non-cacheable
> page into every process's address space and store the
> tick counter there?
> 
> --lyndon

I'm imagining something like (completely untested, just a sketch,
and almost certainly missing some places where the counters need
to be updated):

diff dfee08d50df674cd76f74320bc9c8bc6a4a95f1e uncommitted
--- a//sys/include/tos.h
+++ b//sys/include/tos.h
@@ -17,7 +17,10 @@
 	vlong	kcycles;	/* cycles spent in kernel */
 	vlong	pcycles;	/* cycles spent in process (kernel + user) */
 	ulong	pid;		/* might as well put the pid here */
-	ulong	clock;
+	ulong	clock;		/* profiling clock */
+	vlong	cycles;		/* cycles at last context switch */
+	vlong	tmono;		/* monotonic nsecs since arbitrary start time */
+	vlong	tadj;		/* timesync-adjusted nsecs since epoch */
 	/* top of stack is here */
 };
 
--- a//sys/src/libc/9sys/nsec.c
+++ b//sys/src/libc/9sys/nsec.c
@@ -24,7 +24,25 @@
 	return fd >= 0 && fd2path(fd, buf, sizeof(buf)) == 0 && strcmp(buf, name) == 0;
 }
 
+static vlong
+nsecfast()
+{
+	uvlong t;
+
+	cycles(&t);
+	return tos->tadj + (tos->cycles/tos->cyclefreq);
+}
+
 vlong
+nsecmono()
+{
+	uvlong t;
+
+	cycles(&t);
+	return tos->tmono + (tos->cycles/tos->cyclefreq);
+}
+
+vlong
 nsec(void)
 {
 	static char name[] = "/dev/bintime";
@@ -33,6 +51,8 @@
 	vlong t;
 	int f;
 
+	if(tos->cyclefreq != 0)
+		return nsecfast();
 	if(pidp != nil && *pidp == _tos->pid)
 		f = *fdp;
 	else{
--- a/sys/src/9/port/proc.c
+++ b/sys/src/9/port/proc.c
@@ -132,6 +132,8 @@
 	tos->kcycles += t - up->kentry;
 	tos->pcycles = t + up->pcycles;
 	tos->pid = up->pid;
+	tos->tmono = fastticks2ns(fastticks(nil));
+	tos->tadj = todget();
 }
 
 static void

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] Speeding up snoopy(8)
  2022-10-21 18:52         ` ori
@ 2022-10-21 19:23           ` ori
  0 siblings, 0 replies; 10+ messages in thread
From: ori @ 2022-10-21 19:23 UTC (permalink / raw)
  To: 9front, droyo

Quoth ori@eigenstate.org:
> +	return tos->tadj + (tos->cycles/tos->cyclefreq);

...obviously, this is very wrong, given cyclefreq is in hz.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-10-21 19:25 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-18  4:57 [9front] Speeding up snoopy(8) droyo
2022-10-20 15:55 ` David Arroyo
2022-10-20 16:03   ` Sigrid Solveig Haflínudóttir
2022-10-21 15:38     ` David Arroyo
2022-10-21 16:04       ` ori
2022-10-21 16:54         ` David Arroyo
2022-10-21 17:10           ` ori
2022-10-21 17:40       ` Lyndon Nerenberg (VE7TFX/VE6BBM)
2022-10-21 18:52         ` ori
2022-10-21 19:23           ` ori

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).