i was tracking down a locking issue that was showing up
as alarm() returning too late by up to 1/2 a second.  this is
not an issue in stock plan 9, because alarm processing
happens on all processors.  this gets around locking if
you have more than 1 cpu, and at the expense of hammering
the cachelines related to alarm.  i'd be interested in
timings done on 24-core amd machines.

anyway, the problem is due to the suprisingly slow cga
console.

these timings are based on rdtsc() subtractions around the
named areas for the simple test of cat'ing /lib/pci to the
console.  they are huge:

cycles
printing chars		482510340
scrolling			135746656968
total			137112900000

by introducing a frame buffer to avoid reading from the
cga console for scrolling (a guess based on problems with
graphics performance), we get about a 10x improvement:

printing normal chars	1080381568
scrolling			12046340760
total			13610262120

by guessing that any string >40 bytes is likely to induce scrolling,
we can redraw the whole screen once we're done.  this gives us
100/1000x improvment on our hot spots, but just 7x in run time.

printing chars		33186956
scrolling			24111480
total			1854594800

this looks like about all we can do.

by the way, is there a reason to not use the cycle counter on
archtectures that support it, or is there a reason to still maintain
sys->ticks / MACHP(0)->ticks by using the clock interrupt in
the portable code?

the alarm test program is attached.  the results should be interesting.
it does assume that HZ=1000.

- erik