From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/7643 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Eliminating preference for avoiding thread pointer? Cost on MIPS? Date: Fri, 15 May 2015 23:55:44 -0400 Message-ID: <20150516035544.GA4274@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="X1bOJ3K7DJ5YkBrT" X-Trace: ger.gmane.org 1431748567 29445 80.91.229.3 (16 May 2015 03:56:07 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 16 May 2015 03:56:07 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-7655-gllmg-musl=m.gmane.org@lists.openwall.com Sat May 16 05:56:06 2015 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1YtTCo-0003Jr-Hk for gllmg-musl@m.gmane.org; Sat, 16 May 2015 05:56:06 +0200 Original-Received: (qmail 22263 invoked by uid 550); 16 May 2015 03:56:03 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 22218 invoked from network); 16 May 2015 03:55:58 -0000 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:7643 Archived-At: --X1bOJ3K7DJ5YkBrT Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Traditionally, musl has gone to pretty great lengths to avoid depending on the thread pointer. The original reason was that it was not always initialized, and when it was, the init was lazy. This resulted in a lot of cruft, where we would have lots of constructs of the form: bar = some_predicate ? __pthread_self()->foo : global_foo or similar. Being that these predicates depend(ed) on globals, they were/are rather expensive in position-independent code on most archs. Now that the thread pointer is always initialized at startup (since 1.1.0) and assumed to have succeeded (since 1.1.9; musl now performs HCF if it fails), this seems to be an unnecessary cost. Not only does it cost cycles; it also has a complexity cost in terms of code to maintain the state of the predicates (e.g. the atomics for locale state) and in terms of libc-internal assumptions. So I'd like to just use the thread pointer directly wherever it makes sense, and take advantage of the fact that we have it. Unfortunately, there's one arch where thread-pointer access may be prohibitively costly: old MIPS. On the MIPS o32 ABI, the thread pointer is accessed via the "rdhwr $3,$29" instruction, which was only introduced in MIPS32rev2. MIPS-I, MIPS-II, and possibly the original MIPS32 lack it, and while Linux has a "fast path" trap to emulate it, I'm not clear on how "fast" it is. First, I'd like to find out how slow this trap is. If it's something like 150 cycles, that's ugly but probably acceptable. If it's more like 1000 cycles, that's a big problem. If anyone can run the attached test program on real MIPS-I or MIPS-II hardware and give me the results, please do! Compile it once with -O3 -DDO_RDHWR and once with just -O3 and send the (one-line) output of both to the list. It doesn't matter what libc your MIPS system is using -- any should be fine, but you might need to link with -lrt on glibc or uclibc. Now, depending on the results, we have 2 options: 1. If rdhwr emulation on old MIPS is not horribly slow, just do the unconditional thread-pointer usage with no MIPS-specific changes. 2. If introducing rdhwr all over the place on old MIPS would be a serious performance regression, we take advantage of the fact that we're not using compiler-generate TLS access (which would emit rdhwr instructions) in musl. We control the definition of __pthread_self(), which musl uses internally to get the thread pointer (adjusted to point to the pthread structure), so when compiling code that might run on old MIPS (according to -march settings and the resulting predefined macros), we can define __pthread_self() to an expression or function that first checks a global to see if process is multi-threaded, and if not, just reads the thread pointer from a global instead of using rdhwr. Basically, this would be keeping the same way we're doing things now, but tucking it away as an old-MIPS-specific hack and encapsulating it in __pthread_self() rather than having it in every caller. So I think, whatever the performance results end up being, we have an acceptable path forward to use the (possibly virtual) thread pointer unconditionally throughout musl. Rich --X1bOJ3K7DJ5YkBrT Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="mips_rdhwr.c" #include #include int main() { struct timespec t0, t; unsigned i, x=0; clock_gettime(CLOCK_REALTIME, &t0); for (i=0; i<1000000; i++) { register void *tp __asm__("$3"); #ifdef DO_RDHWR __asm__ __volatile__(".word 0x7c03e83b" : "=r"(tp)); #else __asm__ __volatile__("move %0,$0" : "=r"(tp)); #endif x += (unsigned)tp; } clock_gettime(CLOCK_REALTIME, &t); t.tv_sec -= t0.tv_sec; if ((t.tv_nsec -= t0.tv_nsec) < 0) { t.tv_nsec += 1000000000; t.tv_sec--; } printf("%u %lld.%.9ld\n", x, (long long)t.tv_sec, t.tv_nsec); } --X1bOJ3K7DJ5YkBrT--