Traditionally, musl has gone to pretty great lengths to avoid depending on the thread pointer. The original reason was that it was not always initialized, and when it was, the init was lazy. This resulted in a lot of cruft, where we would have lots of constructs of the form: bar = some_predicate ? __pthread_self()->foo : global_foo or similar. Being that these predicates depend(ed) on globals, they were/are rather expensive in position-independent code on most archs. Now that the thread pointer is always initialized at startup (since 1.1.0) and assumed to have succeeded (since 1.1.9; musl now performs HCF if it fails), this seems to be an unnecessary cost. Not only does it cost cycles; it also has a complexity cost in terms of code to maintain the state of the predicates (e.g. the atomics for locale state) and in terms of libc-internal assumptions. So I'd like to just use the thread pointer directly wherever it makes sense, and take advantage of the fact that we have it. Unfortunately, there's one arch where thread-pointer access may be prohibitively costly: old MIPS. On the MIPS o32 ABI, the thread pointer is accessed via the "rdhwr $3,$29" instruction, which was only introduced in MIPS32rev2. MIPS-I, MIPS-II, and possibly the original MIPS32 lack it, and while Linux has a "fast path" trap to emulate it, I'm not clear on how "fast" it is. First, I'd like to find out how slow this trap is. If it's something like 150 cycles, that's ugly but probably acceptable. If it's more like 1000 cycles, that's a big problem. If anyone can run the attached test program on real MIPS-I or MIPS-II hardware and give me the results, please do! Compile it once with -O3 -DDO_RDHWR and once with just -O3 and send the (one-line) output of both to the list. It doesn't matter what libc your MIPS system is using -- any should be fine, but you might need to link with -lrt on glibc or uclibc. Now, depending on the results, we have 2 options: 1. If rdhwr emulation on old MIPS is not horribly slow, just do the unconditional thread-pointer usage with no MIPS-specific changes. 2. If introducing rdhwr all over the place on old MIPS would be a serious performance regression, we take advantage of the fact that we're not using compiler-generate TLS access (which would emit rdhwr instructions) in musl. We control the definition of __pthread_self(), which musl uses internally to get the thread pointer (adjusted to point to the pthread structure), so when compiling code that might run on old MIPS (according to -march settings and the resulting predefined macros), we can define __pthread_self() to an expression or function that first checks a global to see if process is multi-threaded, and if not, just reads the thread pointer from a global instead of using rdhwr. Basically, this would be keeping the same way we're doing things now, but tucking it away as an old-MIPS-specific hack and encapsulating it in __pthread_self() rather than having it in every caller. So I think, whatever the performance results end up being, we have an acceptable path forward to use the (possibly virtual) thread pointer unconditionally throughout musl. Rich