From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mike Haertel Message-Id: <200112110801.fBB81J357621@ducky.net> To: 9fans@cse.psu.edu Subject: Re: [9fans] bochs still no go In-Reply-To: <20011211032545.43452199B5@mail.cse.psu.edu> Date: Tue, 11 Dec 2001 00:01:19 -0800 Topicbox-Message-UUID: 3280aa70-eaca-11e9-9e20-41e7f4b1d025 >> If "RDMSR" is being used to read the time stamp counter, >> it should be replaced with RDTSC (0x0F 0x31). RDMSR is >> a much slower instruction. > >That's not at all clear. I bet they're approximately >the same on real hardware. RDMSR is much slower under >VMware because it requires trapping into the VMware >runtime, while RDTSC, an unprivileged instruction, does not. Ok, I'll admit to a bit of an unfair advantage on this issue: I can't speak for AMD processors, but I used to work at Intel, as an architect on the team that did the Pentium Pro and Pentium 4 processors. I've seen the microcode, and I can assure you that on Intel processors RDMSR is indeed substantially slower. The reason is that many of the so-called "machine-specific registers" that you can read by RDMSR don't really exist as registers in the hardware at all; instead they are just magic numbers specifying particular values that the processor microcode can put together for you by poking around at bits and pieces of internal state that are often widely distributed throughout the hardware. So the processor's microcode for the RDMSR instruction is roughly equivalent to the following C fragment: RDMSR: if (not in kernel mode) fault; switch (ecx) { ... case 0x10: copy the time stamp counter to (eax:edx); break; ... } whereas the microcode for RDTSC is just: RDTSC: copy the time stamp counter to (eax:edx); On Intel processors, an indirect jump in the microcode (the switch) is guaranteed to be mispredicted, since the usual branch prediction mechanisms for macroinstruction branches do not apply to microcode branches (and especially not microcode indirect jumps), so at minimum RDMSR causes the pipeline to get flushed at least one extra time. In addition RDMSR is specified to be a "serializing instruction", which means that the pipeline is drained of older instructions before the first microinstruction of RDMSR even starts executing. On x86 processors with RDTSC, you can get pretty high precision timing for even very fast operations with the following approach: x = rdtsc(); y = rdtsc(); thing_you_want_to_measure(); z = rdtsc(); cycles = (z - y) - (y - x); (The idea is the "y - x" subtracts out the time required by RDTSC itself.) Using this method on a Pentium III, I measured RDMSR with ecx == 0x10 to require ~90 cycles, and RDTSC to require "only" ~30 cycles. The timing will be similar or identical on the rest of the P6 family (Pentium Pro, Pentium II, Celeron). I don't have a Pentium 4 handy try this on, but I expect the performance difference between RDMSR and RDTSC would be even more pronounced due to the deeper pipeline among other things.