From mboxrd@z Thu Jan  1 00:00:00 1970
Date: Sun, 12 Nov 1995 07:28:13 -0500
From: John Carmack johnc@idcanon1.idsoftware.com
Subject: Graphics issues
Topicbox-Message-UUID: 33d3aab4-eac8-11e9-9e20-41e7f4b1d025
Message-ID: <19951112122813.sGX6fREL8Fs27cjj_3qPuXjEFKRhUHfyYfvIa4o58VY@z>


First, some investigative results:

Plan 9 provides no direct means of directly putting a dynamically =
generated bitmap onto a rectangle of the screen.  The model is to =
upload raw data to an offscreen bitmap, then bitblt from there to =
your destination position on screen.

It is possible to write bitmap data directly to your virtualize =
screen bitmap, theoretically avoiding the bitblt phase, but it =
turns out that the data path from app - 8.5 - kernel for this =
operation is not particularly optimized.  For small blits =
(320*200) there is a small savings because there are less context =
switches, but for large (640*480) blits it gets slower than doing =
the two steps seperately.  A side effect of doing this (bug?) is =
that the upload goes over the entire window layer, including the =
8.5 border, which is usually outside your cliprect.

The timing results I got (in ms, on a 25 mhz 2 bit nextstation, =
640*480 2 bit blits) are:

94	direct write to virtualized screen bitmap
91	write to offscreen bitmap, then bitblt to screen
75	time for the write
16	time for the bitblt

Running the test program without 8.5 dramatically helped the =
numbers.  Side note:  What is the proper way to exit 8.5?  Kill =
it?  I added an exit menu option, because I am hacking around =
inside it for some other things.

41	direct write to virtualized screen bitmap
33	write to offscreen bitmap, then bitblt to screen
22	time for the write
10	time for the bitblt

To see where this stands in absolute terms, I directly mapped the =
framebuffer into memory and timed the copy myself.   To grab the =
NeXT framebuffer, you need to:
Add in 9/next/segment.h:
/*JDC*/	{ SG_PHYSICAL,	"fb",		DISPLAYRAM, 262144,	0,	0 },
Add in your program:
	fb =3D (uchar *)segattach (0, "fb", 0, 262144);

After I got the vram mapped in, a quick looping copy returned 10 =
ms for the copy.  Unwinding only improved it to 9 ms, so the good =
new is that bitblt operates at basically full memory speed.

Some comments:

Timing numbers are very consistent under plan9, a welcome change =
from most unix.

The virtualized screen/bitblt/etc devices are a damn fine thing =
(debugging a window system inside a window is just plain First =
Order Cool), but they definately do get in the way of extracting =
good performance from multimedia / game type applications.

The generated blit code works very well.  My first reaction to the =
dynamically compiled libgnot code was "this is no longer =
apropriate on modern highly cached architectures" (I'm not =
positive about that, but my experience leads me to believe it) , =
but for an old 68040, it seems to be pretty spot on.  I was =
impressed at how well it handled missaligned and varying bit depth =
blittis.


Some potential suggestions:

The wrbitmap() call and the underlying bitblt protocol could be =
extended to accept a full rectangle for it's destination.  It =
might be necesary to 32 bit align the transfered rows, but I see =
little point in allowing row specifications and not column.  That =
would be the conceptual architecture for the action of "put these =
pixels there", without the baggage of the offscreen bitmap that =
never gets referenced with the same data twice.  With some =
attention paid to the efficiency of the data path through 8.5 and =
the devbit device, I'm sure it could be 2x the speed of the =
current operation.  Allowing arbitrary pixel alignment would =
complicate the write a lot, because it would basically become a =
bitblt instead of a copy.


If bitmap memory could be shared between user programs and the =
devbit driver, the only action required would be the bitblit, =
which is 4x to 6x faster than the current operation.  The current =
fixed limit of bitmaps allocated inside the kernel is a fairly big =
problem by itself, so it sounds reasonable to kill two birds with =
one stone by creating a new bitmap memory segment for each =
process, and let it be shared by the kernel and the user process.  =
Wrapper functions could be created to allow automatic =
virtualization over a network.


For some operations, there is just no substitute for having the =
framebuffer memory mapped in.  Yes, you have to deal with all =
format conversions yourself, but you can often combine a final =
operation on your data (like dithering / color space conversion) =
with the transfer to screen, saving at least two main memory =
operations per pixel.  An extreme example is the magnification of =
a rendered scene, where you want to do:

read some pixels
write them four times or nine times to the screen

Instead of:

read some pixels
write them multiple times into a large memory buffer
upload the magnified buffer to a bitmap
bitblt the magnified buffer to the screen.

Even without a multiplexer in the way of the last two steps, there =
is over a factor of 20x difference there.

Framebuffer access can be virtualized even over a network with a =
scheme like:

WriteFramebuffer (Rectangle r, uchar **start, uint *rowwidth);
<do stuff>
FinishFramebuffer ();

If the display is local and the rectangle is completely exposed, =
the start / rowwidth returned values are actually the framebuffer. =
 Otherwise, they are just a memory buffer that will be transfered =
to screen in a more conventional manner upon the call to =
FinishFramebuffer().


Events:

Plan 9 should have a /dev/events device that combines the mouse, =
cons, and time devices.  There would be sizable efficiency, =
development ease, and user interface benefits from this.

Currently, 8.5 can miss mouse clicks, and they aren't properly =
interleaved with keyboard input.  To do a game, I need to get key =
up events, and that totally doesn't fit the cons setup.

Instead of doing:

while (read a key)
	deal with it
read the mouse
	deal with it
read the time


You could do:

read all pending events into a buffer.
while (event)
	if we care about it
		deal with it
	record current time

A control file could allow you to mask events you don't care for, =
like mouse movements or key up events.  You should be able to =
enable a time event that is allways returned last, which would =
mean the read would never block, which is what you want on a sim =
anyway, and it gets rid of the need to fork processes just to =
watch blocking files.

I see this as a no-drawbacks, just-plain-right thing to do.


A side issue:

Occasionally I see a black flash when scrolling text.  It looks =
like what should happen when the devbit device is bit inverting as =
it goes to the screen, but the nextstations should have the =
correct  native pixel format.  Any ideas?



John Carmack
Id Software