* [9front] Speeding up snoopy(8)
@ 2022-10-18 4:57 droyo
2022-10-20 15:55 ` David Arroyo
0 siblings, 1 reply; 10+ messages in thread
From: droyo @ 2022-10-18 4:57 UTC (permalink / raw)
To: 9front
I am a plan 9 novice, especially when it comes to performance, so
thank you in advance for your patience.
I have been trying to diagnose some interesting behavior in 9front's
TCP where it appears to be sending a handful of TCP segments
out-of-order every 1 second or so at gigabit rates, and this is a
of a detour from that.
I setup a 9front VM on qemu with 2 cores and 2GB of memory. I have
not tried to tune the VM parameters at all, and it's very possible
that I'd get different results on real hardware.
I did a trivial tcp benchmark from the 9front VM:
aux/listen1 tcp!*!9999 cat /dev/zero
And connected to the VM from its host. This easily achieves more than
gigabit speeds. The client and server are using sub-interfaces on the
same physical NIC, so latency is very low.
However, when I attempted to capture traffic on the 9front side with
ramfs; snoopy -D -f 'tcp(sd=9999)' /net/ether0 >/tmp/pcap
I found that about 50% of the packets were not recorded (I knew this
because I compared it with a capture I took at the client). With help
from IRC, I could see soft overflows incrementing in the
/net/ether0/$index/stats file. I patched the stats command so I could
watch it during my benchmarks.
While cinap offered some alternative filtering options like ipmux, I
would like to try to improve snoopy's performance. I tried profiling
snoopy with
pid=`{ psu | awk '/snoopy/ { print $2 }' }
echo profile > /proc/$pid/ctl
while() { tprof $pid ; sleep 5 }
Here is a sample during the benchmark, with snoopy's output redirected
to /dev/null to rule out disk or ramfs bottlenecks:
total: 1150
TEXT 00200000
ms % sym
360 31.3 pread
230 20.0 pwrite
120 10.4 nsec
110 9.5 _filterpkt
70 6.0 tracepkt
60 5.2 p_filter
60 5.2 defaultframer
50 4.3 p_filter
40 3.4 be2vlong
20 1.7 read
10 0.8 write
10 0.8 newfilter
10 0.8 main
You can see a full 50% of time is spent in pread/pwrite. I patched
snoopy to dial /net/ether0!-2 instead of /net/ether0!-1, which
truncates the frames to the first 64 bytes, and the overflows went
away.
The first 64 bytes are generally the most useful bytes, so an easy
win would be to use the "-2" dial option when snoopy's "-M" flag is
<=64.
More generally, though, I think snoopy could be much faster if it
read multiple frames per pread() and wrote multiple records per
pwrite(). I am a bit surprised to see pwrite() so high even when
I am redirecting to /dev/null.
It's trivial enough to buffer snoopy's output with bio(2). But I
don't know of an existing interface that would let me read more than
one frame at a time. Is there one? The only other option I can think
of to reduce the effect of pread() would be to move the pread to
a separate thread. But I'm curious to hear your thoughts.
Thanks!
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9front] Speeding up snoopy(8)
2022-10-18 4:57 [9front] Speeding up snoopy(8) droyo
@ 2022-10-20 15:55 ` David Arroyo
2022-10-20 16:03 ` Sigrid Solveig Haflínudóttir
0 siblings, 1 reply; 10+ messages in thread
From: David Arroyo @ 2022-10-20 15:55 UTC (permalink / raw)
To: 9front
On Tue, Oct 18, 2022, at 00:57, droyo@aqwari.net wrote:
> total: 1150
> TEXT 00200000
> ms % sym
> 360 31.3 pread
> 230 20.0 pwrite
> 120 10.4 nsec
> 110 9.5 _filterpkt
> 70 6.0 tracepkt
> 60 5.2 p_filter
> 60 5.2 defaultframer
> 50 4.3 p_filter
> 40 3.4 be2vlong
> 20 1.7 read
> 10 0.8 write
> 10 0.8 newfilter
> 10 0.8 main
> ... snip ...
> More generally, though, I think snoopy could be much faster if it
> read multiple frames per pread() and wrote multiple records per
> pwrite().
I thought about this some more, and I realize that this approach is
premature at best and plain wrong at worst. Snoopy should spend as
much time as possible in pread, because that is when it is
depleting the queue. I should be trying to reduce the time spent
in everything *except* read, and only then should I be worrying
about the per-packet pread overhead.
Reading multiple packets at a time would, however, reduce the number
of nsec() calls.
I will experiment with putting the pwrite in a separate thread, and
look for small performance improvements in the other functions. I
am not sure why nsec() time is so high, but I think that's a function
that could perform very differently on real vs virtual hardware, so
I'm hesitant to try to improve it until I get 9front running on a
real machine.
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9front] Speeding up snoopy(8)
2022-10-20 15:55 ` David Arroyo
@ 2022-10-20 16:03 ` Sigrid Solveig Haflínudóttir
2022-10-21 15:38 ` David Arroyo
0 siblings, 1 reply; 10+ messages in thread
From: Sigrid Solveig Haflínudóttir @ 2022-10-20 16:03 UTC (permalink / raw)
To: 9front
Quoth David Arroyo <droyo@aqwari.net>:
> I will experiment with putting the pwrite in a separate thread, and
> look for small performance improvements in the other functions. I
> am not sure why nsec() time is so high, but I think that's a function
> that could perform very differently on real vs virtual hardware, so
> I'm hesitant to try to improve it until I get 9front running on a
> real machine.
nsec() is a function that results in a syscall. It is also affected
by time adjustments, to my knowledge. At some point I replaced nsec()
call with a (somewhat dirty) nanosec() in vmx, see
/sys/src/cmd/vmx/nanosec.c
It would be nice to have a way to get monotonic timestamps in userspace
without having to go into the kernel. For now, you can try nanosec and
see if anything changes.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9front] Speeding up snoopy(8)
2022-10-20 16:03 ` Sigrid Solveig Haflínudóttir
@ 2022-10-21 15:38 ` David Arroyo
2022-10-21 16:04 ` ori
2022-10-21 17:40 ` Lyndon Nerenberg (VE7TFX/VE6BBM)
0 siblings, 2 replies; 10+ messages in thread
From: David Arroyo @ 2022-10-21 15:38 UTC (permalink / raw)
To: 9front
On Thu, Oct 20, 2022, at 12:03, Sigrid Solveig Haflínudóttir wrote:
> It would be nice to have a way to get monotonic timestamps in userspace
> without having to go into the kernel. For now, you can try nanosec and
> see if anything changes.
I tried using nanosec, and that was sufficient for snoopy to keep up
with a bit rate of 1.6 Gbps, with no overflows. Thank you for the tip!
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9front] Speeding up snoopy(8)
2022-10-21 15:38 ` David Arroyo
@ 2022-10-21 16:04 ` ori
2022-10-21 16:54 ` David Arroyo
2022-10-21 17:40 ` Lyndon Nerenberg (VE7TFX/VE6BBM)
1 sibling, 1 reply; 10+ messages in thread
From: ori @ 2022-10-21 16:04 UTC (permalink / raw)
To: 9front
Quoth David Arroyo <droyo@aqwari.net>:
> On Thu, Oct 20, 2022, at 12:03, Sigrid Solveig Haflínudóttir wrote:
> > It would be nice to have a way to get monotonic timestamps in userspace
> > without having to go into the kernel. For now, you can try nanosec and
> > see if anything changes.
>
> I tried using nanosec, and that was sufficient for snoopy to keep up
> with a bit rate of 1.6 Gbps, with no overflows. Thank you for the tip!
>
> David
...should we replace the nsec() function with it?
we may be able to stash nsec (and monotonic nsec?) at last
context switch into tos, and interpolate using cycles.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9front] Speeding up snoopy(8)
2022-10-21 16:04 ` ori
@ 2022-10-21 16:54 ` David Arroyo
2022-10-21 17:10 ` ori
0 siblings, 1 reply; 10+ messages in thread
From: David Arroyo @ 2022-10-21 16:54 UTC (permalink / raw)
To: 9front
On Fri, Oct 21, 2022, at 12:04, ori@eigenstate.org wrote:
> Quoth David Arroyo <droyo@aqwari.net>:
>> On Thu, Oct 20, 2022, at 12:03, Sigrid Solveig Haflínudóttir wrote:
>> > It would be nice to have a way to get monotonic timestamps in userspace
>> > without having to go into the kernel. For now, you can try nanosec and
>> > see if anything changes.
>>
>> I tried using nanosec, and that was sufficient for snoopy to keep up
>> with a bit rate of 1.6 Gbps, with no overflows. Thank you for the tip!
>
> ...should we replace the nsec() function with it?
>
> we may be able to stash nsec (and monotonic nsec?) at last
> context switch into tos, and interpolate using cycles.
I don't fully understand the tradeoffs in using nanosec() over nsec().
For packet captures, I specifically *don't* want time adjustments during
a capture, so nanosec() being monotonic is a plus.
It doesn't matter for snoopy currently, but is nanosec() thread-safe?
It's keeping xstart in a static variable; would different threads have
different xstarts, or could they step on each other if they call
nanosec() at the same time?
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9front] Speeding up snoopy(8)
2022-10-21 16:54 ` David Arroyo
@ 2022-10-21 17:10 ` ori
0 siblings, 0 replies; 10+ messages in thread
From: ori @ 2022-10-21 17:10 UTC (permalink / raw)
To: 9front
Quoth David Arroyo <droyo@aqwari.net>:
> On Fri, Oct 21, 2022, at 12:04, ori@eigenstate.org wrote:
> > Quoth David Arroyo <droyo@aqwari.net>:
> >> On Thu, Oct 20, 2022, at 12:03, Sigrid Solveig Haflínudóttir wrote:
> >> > It would be nice to have a way to get monotonic timestamps in userspace
> >> > without having to go into the kernel. For now, you can try nanosec and
> >> > see if anything changes.
> >>
> >> I tried using nanosec, and that was sufficient for snoopy to keep up
> >> with a bit rate of 1.6 Gbps, with no overflows. Thank you for the tip!
> >
> > ...should we replace the nsec() function with it?
> >
> > we may be able to stash nsec (and monotonic nsec?) at last
> > context switch into tos, and interpolate using cycles.
>
> I don't fully understand the tradeoffs in using nanosec() over nsec().
> For packet captures, I specifically *don't* want time adjustments during
> a capture, so nanosec() being monotonic is a plus.
>
> It doesn't matter for snoopy currently, but is nanosec() thread-safe?
> It's keeping xstart in a static variable; would different threads have
> different xstarts, or could they step on each other if they call
> nanosec() at the same time?
>
> David
it is not; as it stands, it's not a drop-in replacement for nsec();
doing it properly would mean extending the Tos struct, I think.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9front] Speeding up snoopy(8)
2022-10-21 15:38 ` David Arroyo
2022-10-21 16:04 ` ori
@ 2022-10-21 17:40 ` Lyndon Nerenberg (VE7TFX/VE6BBM)
2022-10-21 18:52 ` ori
1 sibling, 1 reply; 10+ messages in thread
From: Lyndon Nerenberg (VE7TFX/VE6BBM) @ 2022-10-21 17:40 UTC (permalink / raw)
To: 9front, David Arroyo
This is coming completely out of left field, but ...
On machines that support atomic 64-bit read/write, would
it be possible to map a single hardwired non-cacheable
page into every process's address space and store the
tick counter there?
--lyndon
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9front] Speeding up snoopy(8)
2022-10-21 17:40 ` Lyndon Nerenberg (VE7TFX/VE6BBM)
@ 2022-10-21 18:52 ` ori
2022-10-21 19:23 ` ori
0 siblings, 1 reply; 10+ messages in thread
From: ori @ 2022-10-21 18:52 UTC (permalink / raw)
To: 9front, droyo
Quoth Lyndon Nerenberg (VE7TFX/VE6BBM) <lyndon@orthanc.ca>:
> This is coming completely out of left field, but ...
>
> On machines that support atomic 64-bit read/write, would
> it be possible to map a single hardwired non-cacheable
> page into every process's address space and store the
> tick counter there?
>
> --lyndon
I'm imagining something like (completely untested, just a sketch,
and almost certainly missing some places where the counters need
to be updated):
diff dfee08d50df674cd76f74320bc9c8bc6a4a95f1e uncommitted
--- a//sys/include/tos.h
+++ b//sys/include/tos.h
@@ -17,7 +17,10 @@
vlong kcycles; /* cycles spent in kernel */
vlong pcycles; /* cycles spent in process (kernel + user) */
ulong pid; /* might as well put the pid here */
- ulong clock;
+ ulong clock; /* profiling clock */
+ vlong cycles; /* cycles at last context switch */
+ vlong tmono; /* monotonic nsecs since arbitrary start time */
+ vlong tadj; /* timesync-adjusted nsecs since epoch */
/* top of stack is here */
};
--- a//sys/src/libc/9sys/nsec.c
+++ b//sys/src/libc/9sys/nsec.c
@@ -24,7 +24,25 @@
return fd >= 0 && fd2path(fd, buf, sizeof(buf)) == 0 && strcmp(buf, name) == 0;
}
+static vlong
+nsecfast()
+{
+ uvlong t;
+
+ cycles(&t);
+ return tos->tadj + (tos->cycles/tos->cyclefreq);
+}
+
vlong
+nsecmono()
+{
+ uvlong t;
+
+ cycles(&t);
+ return tos->tmono + (tos->cycles/tos->cyclefreq);
+}
+
+vlong
nsec(void)
{
static char name[] = "/dev/bintime";
@@ -33,6 +51,8 @@
vlong t;
int f;
+ if(tos->cyclefreq != 0)
+ return nsecfast();
if(pidp != nil && *pidp == _tos->pid)
f = *fdp;
else{
--- a/sys/src/9/port/proc.c
+++ b/sys/src/9/port/proc.c
@@ -132,6 +132,8 @@
tos->kcycles += t - up->kentry;
tos->pcycles = t + up->pcycles;
tos->pid = up->pid;
+ tos->tmono = fastticks2ns(fastticks(nil));
+ tos->tadj = todget();
}
static void
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2022-10-21 19:25 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-18 4:57 [9front] Speeding up snoopy(8) droyo
2022-10-20 15:55 ` David Arroyo
2022-10-20 16:03 ` Sigrid Solveig Haflínudóttir
2022-10-21 15:38 ` David Arroyo
2022-10-21 16:04 ` ori
2022-10-21 16:54 ` David Arroyo
2022-10-21 17:10 ` ori
2022-10-21 17:40 ` Lyndon Nerenberg (VE7TFX/VE6BBM)
2022-10-21 18:52 ` ori
2022-10-21 19:23 ` ori
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).