i tried it myself:

% for(i in 1 2 3 4){
	time fcp sun.tgz /dev/null
	time cp sun.tgz /dev/null
	time hget http://plan9.bell-labs.com/magic/9down4e/compressed/1108754619.nm555mqv7uc7rvvyye52p4zcaeeziq2d/sun.tgz  > /dev/null
}
0.00u 0.01s 12.09r 	 fcp sun.tgz /dev/null
0.00u 0.03s 30.37r 	 cp sun.tgz /dev/null
0.03u 0.11s 11.93r 	 hget http://plan9.bell-labs.com/magic/9down4e/compressed/1108754619.nm555mqv7uc7rvvyye52p4zcaeeziq2d/sun.tgz
0.00u 0.04s 12.16r 	 fcp sun.tgz /dev/null
0.00u 0.00s 30.32r 	 cp sun.tgz /dev/null
0.01u 0.06s 10.16r 	 hget http://plan9.bell-labs.com/magic/9down4e/compressed/1108754619.nm555mqv7uc7rvvyye52p4zcaeeziq2d/sun.tgz
0.00u 0.04s 12.46r 	 fcp sun.tgz /dev/null
0.00u 0.01s 30.24r 	 cp sun.tgz /dev/null
0.08u 0.02s 9.71r 	 hget http://plan9.bell-labs.com/magic/9down4e/compressed/1108754619.nm555mqv7uc7rvvyye52p4zcaeeziq2d/sun.tgz
0.00u 0.01s 11.86r 	 fcp sun.tgz /dev/null
0.00u 0.03s 30.10r 	 cp sun.tgz /dev/null
0.05u 0.07s 9.93r 	 hget http://plan9.bell-labs.com/magic/9down4e/compressed/1108754619.nm555mqv7uc7rvvyye52p4zcaeeziq2d/sun.tgz

overhead was averaging about 15% there.
it seems it isn't nearly as bad as i remember, which is good!

BTW, there's a bug in fcp; you need to malloc the
buffer separately inside each thread, otherwise
you get data corruption.