Chris King wrote:
On 9/6/07, Tom <tom.primozic@gmail.com> wrote:
  
However, would it be possible to "emulate" cpu registers using software? By
keeping registers in the main memory, but accessing them often enough to
keep them in primary cache? That would be quite fast I believe...
    

This makes me wonder... why have registers to begin with?  I wonder
how feasible a chip with a, say, 256-byte "register-level" cache would
be.
  
Such chips exist.  The Itanium is one example.

The problem is gate delays.  The purpose of registers is to be faster than L1 cache (which typically has a 2-3 clock delay associated with it).  But the more registers you have, the more gate delays you need to read or write registers- the naive implementation takes O(log N) gate delays to access O(N) registers- reality is more complicated than this.  But the rule more registers = more gate delays holds true.  And these gate delays translate into a slower chip (one way or another- either you have to lower your clock rate or add more pipeline stages or both to deal with the larger register cache).  Of course, more registers make compilers happy, and lowers pressure on the cache bandwidth (as the compiler doesn't need to spill/refill registers quite so often).  This is why the 64-bit x86 is generally faster than the 32-bit x86- going from 8 (6 in practice) to 16 (14 in practice) registers was a big step up.  The Itanium has a large enough register set that it's performance is probably getting hurt by it, but it's hard to tell with the everything else going on.

The sweet spot for register sets seems to be in the 16-64 range- less than that, and you're being hurt by the increased memory pressure, more than that and you're probably being hurt by the slower register addressing.

Brian