Traditionally, musl has gone to pretty great lengths to avoid
depending on the thread pointer. The original reason was that it was
not always initialized, and when it was, the init was lazy. This
resulted in a lot of cruft, where we would have lots of constructs of
the form:

	bar = some_predicate ? __pthread_self()->foo : global_foo

or similar. Being that these predicates depend(ed) on globals, they
were/are rather expensive in position-independent code on most archs.
Now that the thread pointer is always initialized at startup (since
1.1.0) and assumed to have succeeded (since 1.1.9; musl now performs
HCF if it fails), this seems to be an unnecessary cost. Not only does
it cost cycles; it also has a complexity cost in terms of code to
maintain the state of the predicates (e.g. the atomics for locale
state) and in terms of libc-internal assumptions. So I'd like to just
use the thread pointer directly wherever it makes sense, and take
advantage of the fact that we have it.

Unfortunately, there's one arch where thread-pointer access may be
prohibitively costly: old MIPS. On the MIPS o32 ABI, the thread
pointer is accessed via the "rdhwr $3,$29" instruction, which was only
introduced in MIPS32rev2. MIPS-I, MIPS-II, and possibly the original
MIPS32 lack it, and while Linux has a "fast path" trap to emulate it,
I'm not clear on how "fast" it is.

First, I'd like to find out how slow this trap is. If it's something
like 150 cycles, that's ugly but probably acceptable. If it's more
like 1000 cycles, that's a big problem. If anyone can run the attached
test program on real MIPS-I or MIPS-II hardware and give me the
results, please do! Compile it once with -O3 -DDO_RDHWR and once with
just -O3 and send the (one-line) output of both to the list. It
doesn't matter what libc your MIPS system is using -- any should be
fine, but you might need to link with -lrt on glibc or uclibc.

Now, depending on the results, we have 2 options:

1. If rdhwr emulation on old MIPS is not horribly slow, just do the
   unconditional thread-pointer usage with no MIPS-specific changes.

2. If introducing rdhwr all over the place on old MIPS would be a
   serious performance regression, we take advantage of the fact that
   we're not using compiler-generate TLS access (which would emit
   rdhwr instructions) in musl. We control the definition of
   __pthread_self(), which musl uses internally to get the thread
   pointer (adjusted to point to the pthread structure), so when
   compiling code that might run on old MIPS (according to -march
   settings and the resulting predefined macros), we can define
   __pthread_self() to an expression or function that first checks a
   global to see if process is multi-threaded, and if not, just reads
   the thread pointer from a global instead of using rdhwr. Basically,
   this would be keeping the same way we're doing things now, but
   tucking it away as an old-MIPS-specific hack and encapsulating it
   in __pthread_self() rather than having it in every caller.

So I think, whatever the performance results end up being, we have an
acceptable path forward to use the (possibly virtual) thread pointer
unconditionally throughout musl.

Rich