From mboxrd@z Thu Jan  1 00:00:00 1970
From: erik quanstrom <quanstro@quanstro.net>
Date: Sun,  9 Jan 2011 12:06:21 -0500
To: 9fans@9fans.net
Message-ID: <16094d5a594bfa72dd0e9ac6f3f8b31c@plug.quanstro.net>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Subject: [9fans] fs performance
Topicbox-Message-UUID: 92b300f8-ead6-11e9-9d60-3106f5b1d025

the new auth server, which uses the fs as its root rather than
a stand-alone fs,  happens to be faster than our now-old
cpu server, so i did a quick build test with a kernel including
the massive-fw myricom driver.  suspecting that latency kills
even on 10gbe, i tried a second build with NPROC=24. a
table comparing ken fs, fossil+venti, and ramfs follows.
unfortunately, i was not able to use the same system for the
fossil+venti tests, but there's a ramfs test on the same system
to bring things into perspective due to the large differences
in processor generation, network, &c.  here's an example test:

	tyty; echo $NPROC
	4
	tyty; time mk>/dev/null && mk clean>/dev/null
	2.93u 1.30s 3.36r 	 mk
	tyty; NPROC=24 time mk >/dev/null && mk clean>/dev/null
	1.32u 0.22s 2.29r 	 mk

and here are the compiled results:

a	Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
	4 active cores (8 threads; 4 enabled);
	http://ark.intel.com/Product.aspx?id=35365
	intel 82598 10gbe nic; fs has myricom 10gbe nic; 54µs latency
b	Intel(R) Core(TM)2 Quad CPU    Q9400  @ 2.66GHz
	4 active cores (4 threads; 4 enabled);
	http://www.intel.com/p/en_US/products/server/processor/xeon5000/specifications
	intel 82563-style gbe nic; 70µs latency

mach	fs	nproc	time
a	ken	4	2.93u 1.30s 3.36r 	 mk
		24	1.32u 0.22s 2.29r 	 mk
	ramfs	4	3.10u 1.67s 3.01r 	 mk
		24	2.98u 1.23s 2.42r 	 mk
b	venti	4	2.65u 3.44s 21.46r 	 mk
		24	2.98u 3.56s 21.58r 	 mk
	ramfs	4	3.55u 2.22s 9.08r 	 mk
		24	3.50u 2.67s 9.41r 	 mk

it's interesting that neither venti nor ramfs get any faster
on machine b with NPROCS set to 24, but both get
faster on machine a and the fastest time of all is not
ramfs, but ken's fs with NPROC=24.  so i suppose the
64-bit question is, is that because moving data in and
out of user space is slower than 10gbe, or because ramfs
is single threaded and slow?

in any event, it's clear that if the fs is good, latency
can kill even on 10gbe lan.  it would naively seem to me that
using the Tstream model would be too expensive, requiring
thousands of new streams, and require modifying at
least 8c, 8l, mk, rc, awk (what am i forgetting?).  but
it would be worth a test.

- erik