From mboxrd@z Thu Jan  1 00:00:00 1970
From: Mike Haertel <mike@ducky.net>
Message-Id: <200112110801.fBB81J357621@ducky.net>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] bochs still no go
In-Reply-To: <20011211032545.43452199B5@mail.cse.psu.edu>
Date: Tue, 11 Dec 2001 00:01:19 -0800
Topicbox-Message-UUID: 3280aa70-eaca-11e9-9e20-41e7f4b1d025

>> If "RDMSR" is being used to read the time stamp counter,
>> it should be replaced with RDTSC (0x0F 0x31).  RDMSR is
>> a much slower instruction.
>
>That's not at all clear.  I bet they're approximately
>the same on real hardware.  RDMSR is much slower under
>VMware because it requires trapping into the VMware
>runtime, while RDTSC, an unprivileged instruction, does not.

Ok, I'll admit to a bit of an unfair advantage on this issue: I
can't speak for AMD processors, but I used to work at Intel, as an
architect on the team that did the Pentium Pro and Pentium 4
processors.  I've seen the microcode, and I can assure you that on
Intel processors RDMSR is indeed substantially slower.

The reason is that many of the so-called "machine-specific registers"
that you can read by RDMSR don't really exist as registers in
the hardware at all; instead they are just magic numbers specifying
particular values that the processor microcode can put together
for you by poking around at bits and pieces of internal state
that are often widely distributed throughout the hardware.

So the processor's microcode for the RDMSR instruction is roughly
equivalent to the following C fragment:

	RDMSR:
		if (not in kernel mode)
			fault;
		switch (ecx) {
		...
		case 0x10:
			copy the time stamp counter to (eax:edx);
			break;
		...
		}

whereas the microcode for RDTSC is just:

	RDTSC:
		copy the time stamp counter to (eax:edx);

On Intel processors, an indirect jump in the microcode (the switch)
is guaranteed to be mispredicted, since the usual branch prediction
mechanisms for macroinstruction branches do not apply to microcode
branches (and especially not microcode indirect jumps), so at minimum
RDMSR causes the pipeline to get flushed at least one extra time.
In addition RDMSR is specified to be a "serializing instruction",
which means that the pipeline is drained of older instructions
before the first microinstruction of RDMSR even starts executing.

On x86 processors with RDTSC, you can get pretty high precision
timing for even very fast operations with the following approach:
	x = rdtsc();
	y = rdtsc();
	thing_you_want_to_measure();
	z = rdtsc();
	cycles = (z - y) - (y - x);
(The idea is the "y - x" subtracts out the time required by RDTSC itself.)

Using this method on a Pentium III, I measured RDMSR with ecx == 0x10
to require ~90 cycles, and RDTSC to require "only" ~30 cycles.  The timing
will be similar or identical on the rest of the P6 family (Pentium Pro,
Pentium II, Celeron).

I don't have a Pentium 4 handy try this on, but I expect the performance
difference between RDMSR and RDTSC would be even more pronounced due
to the deeper pipeline among other things.