I am a plan 9 novice, especially when it comes to performance, so thank you in advance for your patience. I have been trying to diagnose some interesting behavior in 9front's TCP where it appears to be sending a handful of TCP segments out-of-order every 1 second or so at gigabit rates, and this is a of a detour from that. I setup a 9front VM on qemu with 2 cores and 2GB of memory. I have not tried to tune the VM parameters at all, and it's very possible that I'd get different results on real hardware. I did a trivial tcp benchmark from the 9front VM: aux/listen1 tcp!*!9999 cat /dev/zero And connected to the VM from its host. This easily achieves more than gigabit speeds. The client and server are using sub-interfaces on the same physical NIC, so latency is very low. However, when I attempted to capture traffic on the 9front side with ramfs; snoopy -D -f 'tcp(sd=9999)' /net/ether0 >/tmp/pcap I found that about 50% of the packets were not recorded (I knew this because I compared it with a capture I took at the client). With help from IRC, I could see soft overflows incrementing in the /net/ether0/$index/stats file. I patched the stats command so I could watch it during my benchmarks. While cinap offered some alternative filtering options like ipmux, I would like to try to improve snoopy's performance. I tried profiling snoopy with pid=`{ psu | awk '/snoopy/ { print $2 }' } echo profile > /proc/$pid/ctl while() { tprof $pid ; sleep 5 } Here is a sample during the benchmark, with snoopy's output redirected to /dev/null to rule out disk or ramfs bottlenecks: total: 1150 TEXT 00200000 ms % sym 360 31.3 pread 230 20.0 pwrite 120 10.4 nsec 110 9.5 _filterpkt 70 6.0 tracepkt 60 5.2 p_filter 60 5.2 defaultframer 50 4.3 p_filter 40 3.4 be2vlong 20 1.7 read 10 0.8 write 10 0.8 newfilter 10 0.8 main You can see a full 50% of time is spent in pread/pwrite. I patched snoopy to dial /net/ether0!-2 instead of /net/ether0!-1, which truncates the frames to the first 64 bytes, and the overflows went away. The first 64 bytes are generally the most useful bytes, so an easy win would be to use the "-2" dial option when snoopy's "-M" flag is <=64. More generally, though, I think snoopy could be much faster if it read multiple frames per pread() and wrote multiple records per pwrite(). I am a bit surprised to see pwrite() so high even when I am redirecting to /dev/null. It's trivial enough to buffer snoopy's output with bio(2). But I don't know of an existing interface that would let me read more than one frame at a time. Is there one? The only other option I can think of to reduce the effect of pread() would be to move the pread to a separate thread. But I'm curious to hear your thoughts. Thanks! David
On Tue, Oct 18, 2022, at 00:57, droyo@aqwari.net wrote:
> total: 1150
> TEXT 00200000
> ms % sym
> 360 31.3 pread
> 230 20.0 pwrite
> 120 10.4 nsec
> 110 9.5 _filterpkt
> 70 6.0 tracepkt
> 60 5.2 p_filter
> 60 5.2 defaultframer
> 50 4.3 p_filter
> 40 3.4 be2vlong
> 20 1.7 read
> 10 0.8 write
> 10 0.8 newfilter
> 10 0.8 main
> ... snip ...
> More generally, though, I think snoopy could be much faster if it
> read multiple frames per pread() and wrote multiple records per
> pwrite().
I thought about this some more, and I realize that this approach is
premature at best and plain wrong at worst. Snoopy should spend as
much time as possible in pread, because that is when it is
depleting the queue. I should be trying to reduce the time spent
in everything *except* read, and only then should I be worrying
about the per-packet pread overhead.
Reading multiple packets at a time would, however, reduce the number
of nsec() calls.
I will experiment with putting the pwrite in a separate thread, and
look for small performance improvements in the other functions. I
am not sure why nsec() time is so high, but I think that's a function
that could perform very differently on real vs virtual hardware, so
I'm hesitant to try to improve it until I get 9front running on a
real machine.
David
Quoth David Arroyo <droyo@aqwari.net>:
> I will experiment with putting the pwrite in a separate thread, and
> look for small performance improvements in the other functions. I
> am not sure why nsec() time is so high, but I think that's a function
> that could perform very differently on real vs virtual hardware, so
> I'm hesitant to try to improve it until I get 9front running on a
> real machine.
nsec() is a function that results in a syscall. It is also affected
by time adjustments, to my knowledge. At some point I replaced nsec()
call with a (somewhat dirty) nanosec() in vmx, see
/sys/src/cmd/vmx/nanosec.c
It would be nice to have a way to get monotonic timestamps in userspace
without having to go into the kernel. For now, you can try nanosec and
see if anything changes.
On Thu, Oct 20, 2022, at 12:03, Sigrid Solveig Haflínudóttir wrote:
> It would be nice to have a way to get monotonic timestamps in userspace
> without having to go into the kernel. For now, you can try nanosec and
> see if anything changes.
I tried using nanosec, and that was sufficient for snoopy to keep up
with a bit rate of 1.6 Gbps, with no overflows. Thank you for the tip!
David
Quoth David Arroyo <droyo@aqwari.net>:
> On Thu, Oct 20, 2022, at 12:03, Sigrid Solveig Haflínudóttir wrote:
> > It would be nice to have a way to get monotonic timestamps in userspace
> > without having to go into the kernel. For now, you can try nanosec and
> > see if anything changes.
>
> I tried using nanosec, and that was sufficient for snoopy to keep up
> with a bit rate of 1.6 Gbps, with no overflows. Thank you for the tip!
>
> David
...should we replace the nsec() function with it?
we may be able to stash nsec (and monotonic nsec?) at last
context switch into tos, and interpolate using cycles.
On Fri, Oct 21, 2022, at 12:04, ori@eigenstate.org wrote:
> Quoth David Arroyo <droyo@aqwari.net>:
>> On Thu, Oct 20, 2022, at 12:03, Sigrid Solveig Haflínudóttir wrote:
>> > It would be nice to have a way to get monotonic timestamps in userspace
>> > without having to go into the kernel. For now, you can try nanosec and
>> > see if anything changes.
>>
>> I tried using nanosec, and that was sufficient for snoopy to keep up
>> with a bit rate of 1.6 Gbps, with no overflows. Thank you for the tip!
>
> ...should we replace the nsec() function with it?
>
> we may be able to stash nsec (and monotonic nsec?) at last
> context switch into tos, and interpolate using cycles.
I don't fully understand the tradeoffs in using nanosec() over nsec().
For packet captures, I specifically *don't* want time adjustments during
a capture, so nanosec() being monotonic is a plus.
It doesn't matter for snoopy currently, but is nanosec() thread-safe?
It's keeping xstart in a static variable; would different threads have
different xstarts, or could they step on each other if they call
nanosec() at the same time?
David
Quoth David Arroyo <droyo@aqwari.net>:
> On Fri, Oct 21, 2022, at 12:04, ori@eigenstate.org wrote:
> > Quoth David Arroyo <droyo@aqwari.net>:
> >> On Thu, Oct 20, 2022, at 12:03, Sigrid Solveig Haflínudóttir wrote:
> >> > It would be nice to have a way to get monotonic timestamps in userspace
> >> > without having to go into the kernel. For now, you can try nanosec and
> >> > see if anything changes.
> >>
> >> I tried using nanosec, and that was sufficient for snoopy to keep up
> >> with a bit rate of 1.6 Gbps, with no overflows. Thank you for the tip!
> >
> > ...should we replace the nsec() function with it?
> >
> > we may be able to stash nsec (and monotonic nsec?) at last
> > context switch into tos, and interpolate using cycles.
>
> I don't fully understand the tradeoffs in using nanosec() over nsec().
> For packet captures, I specifically *don't* want time adjustments during
> a capture, so nanosec() being monotonic is a plus.
>
> It doesn't matter for snoopy currently, but is nanosec() thread-safe?
> It's keeping xstart in a static variable; would different threads have
> different xstarts, or could they step on each other if they call
> nanosec() at the same time?
>
> David
it is not; as it stands, it's not a drop-in replacement for nsec();
doing it properly would mean extending the Tos struct, I think.
This is coming completely out of left field, but ... On machines that support atomic 64-bit read/write, would it be possible to map a single hardwired non-cacheable page into every process's address space and store the tick counter there? --lyndon
Quoth Lyndon Nerenberg (VE7TFX/VE6BBM) <lyndon@orthanc.ca>:
> This is coming completely out of left field, but ...
>
> On machines that support atomic 64-bit read/write, would
> it be possible to map a single hardwired non-cacheable
> page into every process's address space and store the
> tick counter there?
>
> --lyndon
I'm imagining something like (completely untested, just a sketch,
and almost certainly missing some places where the counters need
to be updated):
diff dfee08d50df674cd76f74320bc9c8bc6a4a95f1e uncommitted
--- a//sys/include/tos.h
+++ b//sys/include/tos.h
@@ -17,7 +17,10 @@
vlong kcycles; /* cycles spent in kernel */
vlong pcycles; /* cycles spent in process (kernel + user) */
ulong pid; /* might as well put the pid here */
- ulong clock;
+ ulong clock; /* profiling clock */
+ vlong cycles; /* cycles at last context switch */
+ vlong tmono; /* monotonic nsecs since arbitrary start time */
+ vlong tadj; /* timesync-adjusted nsecs since epoch */
/* top of stack is here */
};
--- a//sys/src/libc/9sys/nsec.c
+++ b//sys/src/libc/9sys/nsec.c
@@ -24,7 +24,25 @@
return fd >= 0 && fd2path(fd, buf, sizeof(buf)) == 0 && strcmp(buf, name) == 0;
}
+static vlong
+nsecfast()
+{
+ uvlong t;
+
+ cycles(&t);
+ return tos->tadj + (tos->cycles/tos->cyclefreq);
+}
+
vlong
+nsecmono()
+{
+ uvlong t;
+
+ cycles(&t);
+ return tos->tmono + (tos->cycles/tos->cyclefreq);
+}
+
+vlong
nsec(void)
{
static char name[] = "/dev/bintime";
@@ -33,6 +51,8 @@
vlong t;
int f;
+ if(tos->cyclefreq != 0)
+ return nsecfast();
if(pidp != nil && *pidp == _tos->pid)
f = *fdp;
else{
--- a/sys/src/9/port/proc.c
+++ b/sys/src/9/port/proc.c
@@ -132,6 +132,8 @@
tos->kcycles += t - up->kentry;
tos->pcycles = t + up->pcycles;
tos->pid = up->pid;
+ tos->tmono = fastticks2ns(fastticks(nil));
+ tos->tadj = todget();
}
static void
Quoth ori@eigenstate.org:
> + return tos->tadj + (tos->cycles/tos->cyclefreq);
...obviously, this is very wrong, given cyclefreq is in hz.