From mboxrd@z Thu Jan 1 00:00:00 1970 From: Linus Torvalds To: Rob Pike Cc: 9fans@cse.psu.edu Subject: Re: [9fans] Re: Threads: Sewing badges of honor onto a Kernel In-Reply-To: <98DCC07A-6939-11D8-B851-000A95B984D8@mightycheese.com> Message-ID: References: <95d43a901ffade2982d28db8d39180b2@collyer.net> <98DCC07A-6939-11D8-B851-000A95B984D8@mightycheese.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Date: Fri, 27 Feb 2004 08:57:46 -0800 Topicbox-Message-UUID: 003250fc-eacd-11e9-9e20-41e7f4b1d025 On Fri, 27 Feb 2004, Rob Pike wrote: > > > Having a pointer that sometimes works, and sometimes doesn't, based on who > > uses it - that's just crazy talk. > > put in those terms, it sounds weird, but it's not. consider the old u. > area in unix. that was a piece of address space with the same virtual > address in all processes but a different content. I hate to put it to you, Rob, but that sucks. It sucks so hard that it's not even funny. Playing tricks with the VM is a cool hack, and we've considered it multiple times, but in the end sanity has always prevailed. Playing VM tricks is _expensive_. It looks cheap because the access code "just works", but the expense is in - wasting TLB entries (all sane CPU's have big "fixed" areas that are better on the TLB for system mode - large pages, BAT registers, or just architected 1:1 mappings). The kernel should use them, because the fewer TLB entries the kernel uses, the more there are for user space to waste. But that implies that the kernel shouldn't play any VM tricks with its own data - the VM is for user space (yeah, the kernel ends up wanting to use the VM hardware for some things, but it's discouraged) - CPU's with hardware TLB lookup have to have separate page tables for different CPU's. Which means not only that you're wasting memory on having copies of the thing, it also means that now you have to have code to maintain coherency between those separate page tables, and have to have locking. Having just one copy is just better. Two copies of the same thing is bad. - TLB invalidates are just a lot more expensive than changing a register around. It's bad even on "good" hardware, it's totally unacceptable on anything with a virtual cache, for example. So the kernel internally has always had the stack pointer register as the main "thread pointer". It did that even before SMP, just because it was simpler and faster. With SMP, doing anything else becomes unacceptable. > the system used the fact that the addresses aliased that way. plan 9's > thread model does a similar thing by constructing a special storage > class for data private to each process. for instance, one can have a > variable with the same address in each process, call it tp, that points > to the thread-specific data, so you can write code like > > printf("my id is %d\n", tp->id); Yes. And you pay the price. For no good reason, I may add, since you traditionally have been able to do the same by just having to add some explicit thread tracking (it wouldn't be "tp->id", it would be "mythread()->tp->id") or by adding compiler support to make the syntax be easier. These days, if you want to avoid the syntax of carrying that per-thread pointer around, we have compiler support, ie you can do __thread struct mystruct *tp = NULL; and now "tp" is a per-thread variable. Behind the schenes the compiler will change it to be an offset off the TLS pointer, pretty much exactly the same way it does position-independent code. > > The same way it creates any other storage: with mmap() and brk(). You just > > malloc the thing, and you pass in the new stack as an argument to the > > thread creation mechanism (which linux calls "clone()", just to be > > different). > > that wasn't what i was asking. i was referring to this special storage > class. how does a thread identify who it is? A long time ago, it was literally a "gettid()" system call. If you wanted the thread-local space, you followed that by a index lookup. It's not insanely expensive if you avoid re-generating the thread-local pointer all the time, and pass it down as a regular argument, but it is obviously syntactically not pretty. These days - mostly thanks to compiler and library advances, not so much any real kernel changes - the thread infrastructure sets up its local pointers in registers, so that you can use the above "__thread" specifier in the compiler, and when you access tp->id the compiler will actually generate movl %gs:tp@NTPOFF, %eax movl (%eax), %eax for you (on other platforms that have compiler support of thread-local storage it usually would end up being a indirect access through a regular register). The linker fixes these things up, the same way it does things like GOT tables etc. > ah, i see in later mail that you answered this. there are now pointers > created in the user space (i think) to thread-local storage. how is it > accessed, that is, how does the user process derive the pointer to it? > this state stuff did not exist when we did the inferno port. See above. If you control your environment (ie you don't have to worry about having arbitrary TLS space), you can do better with the stack register trick the kernel uses, but indirection will handle the general case. > it will work; it's the magic address bits hack. which kernel version > introduced this stuff? i've heard people say that 2.6 is the first one > with the default thread model being 'efficient' and 'good', but i don't > know the specifics. i've also heard that they can be retrofitted to > 2.4. The new threading model in 2.6.x is really more about signal handling than anything else. The _real_ problem with the original clone() implementation had nothing to do with the VM, and had everything to do with insane POSIX shared signal semantics. It's really hard to get the POSIX thread signal semantics rigt, since the whole pthreads thing really was designed for having all threads run within one master process, and Linux never had the notion of "process vs threads". The signal case that is hard to get right in POSIX is the fact that signal masks are thread-local, yet their effect is "process global" (ie when you change the signal mask of your thread, that means that you suddenly now potentially start accepting pending signals that were shared process global). I still don't like that POSIX model, and I didn't see any sane way to do it efficiently with truly independent threads that don't have the notion of a "process" that encompasses them. What 2.6.x (and the 2.4.x back-port) does is to just accept the fact that there is a "thread group leader" (that's what the CLONE_THREAD flag does: if it is set, you share the thread group leader, if it is clear you create a new thread group), and that pending signal state really is shared in the thread group. The VM side has always been the same: if you share the VM, you share everything. There literally isn't any thread-local storage from a VM standpoint, there are only thread-local registers that point to different areas of memory. > it's interesting you advocate using registers for the magic storage > class. it's a great trick when you can do it - plan 9 uses it in the > kernel on machines with lots of registers - but it's not so great on a > machine with too few registers, like the x86. Well, even in the absense of a register, you can always just have a system call to ask what the pointer should be. That really does work very well, as long as your programming model is about explicit thread pointers (which pthreads is) so that you don't have to do it all the time. And the x86 really is the worst possible case, in this situation, because it is so register-starved anyway. But happily, it has some (very ugly) legacy registers that have to be user-visible, and have to be saved and restored anyway, and that nobody sane really wants to use, so the thread model can use them. Making the threaded stuff explicit helps avoid confusion. Now, if somebody takes an address of a per-thread variable, it is clear that that address is the address of the variable IN THAT THREAD. You can pass it along, but when you pass it along to another thread, it doesn't change value - it still points to the exact same thread-local variable in the _original_ thread. (Obviously you can pass around offsets to the thread-local space if you want to, although I can't really see why you'd do it). And I hope it's clear by now that because the thing is entirely in registers, that "thread model" is pretty much all in user space. It needs no kernel support, although some of the interfaces are obviously done certain ways to make it easier to do (ie the kernel does know about a TLS pointer at thread setup, even if the kernel doesn't actually _use_ it for anything, it just sets up the register state as indicated by the parent of the thread). Linus