From mboxrd@z Thu Jan  1 00:00:00 1970
From: Linus Torvalds <torvalds@osdl.org>
To: Rob Pike <rob@mightycheese.com>
Cc: 9fans@cse.psu.edu
Subject: Re: [9fans] Re: Threads: Sewing badges of honor onto a Kernel
In-Reply-To: <98DCC07A-6939-11D8-B851-000A95B984D8@mightycheese.com>
Message-ID: <Pine.LNX.4.58.0402270814250.2563@ppc970.osdl.org>
References: <95d43a901ffade2982d28db8d39180b2@collyer.net>
 <98DCC07A-6939-11D8-B851-000A95B984D8@mightycheese.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Date: Fri, 27 Feb 2004 08:57:46 -0800
Topicbox-Message-UUID: 003250fc-eacd-11e9-9e20-41e7f4b1d025



On Fri, 27 Feb 2004, Rob Pike wrote:
>
>  > Having a pointer that sometimes works, and sometimes doesn't, based  on who
>  > uses it - that's just crazy talk.
>
> put in those terms, it sounds weird, but it's not.  consider the old u.
> area in unix. that was a piece of address space with the same virtual
> address in all processes but a different content.

I hate to put it to you, Rob, but that sucks. It sucks so hard that it's
not even funny.

Playing tricks with the VM is a cool hack, and we've considered it
multiple times, but in the end sanity has always prevailed. Playing VM
tricks is _expensive_. It looks cheap because the access code "just
works", but the expense is in

 - wasting TLB entries (all sane CPU's have big "fixed" areas that are
   better on the TLB for system mode - large pages, BAT registers, or just
   architected 1:1 mappings). The kernel should use them, because the
   fewer TLB entries the kernel uses, the more there are for user space to
   waste. But that implies that the kernel shouldn't play any VM tricks
   with its own data - the VM is for user space (yeah, the kernel ends up
   wanting to use the VM hardware for some things, but it's discouraged)

 - CPU's with hardware TLB lookup have to have separate page tables for
   different CPU's. Which means not only that you're wasting memory on
   having <n> copies of the thing, it also means that now you have to have
   code to maintain coherency between those separate page tables, and have
   to have locking.

   Having just one copy is just better. Two copies of the same thing is
   bad.

 - TLB invalidates are just a lot more expensive than changing a register
   around. It's bad even on "good" hardware, it's totally unacceptable on
   anything with a virtual cache, for example.

So the kernel internally has always had the stack pointer register as the
main "thread pointer". It did that even before SMP, just because it was
simpler and faster. With SMP, doing anything else becomes unacceptable.

>  the system used the fact that the addresses aliased that way.  plan 9's
> thread model does a similar thing by constructing a special storage
> class for data private to each process.  for instance, one can have a
> variable with the same address in each process, call it tp, that points
> to the thread-specific data, so you can write code like
>
> 	printf("my id is %d\n", tp->id);

Yes. And you pay the price. For no good reason, I may add, since you
traditionally have been able to do the same by just having to add some
explicit thread tracking (it wouldn't be "tp->id", it would be
"mythread()->tp->id") or by adding compiler support to make the syntax be
easier.

These days, if you want to avoid the syntax of carrying that per-thread
pointer around, we have compiler support, ie you can do

	__thread struct mystruct *tp = NULL;

and now "tp" is a per-thread variable. Behind the schenes the compiler
will change it to be an offset off the TLS pointer, pretty much exactly
the same way it does position-independent code.

>  > The same way it creates any other storage: with mmap() and brk(). You just
>  > malloc the thing, and you pass in the new stack as an argument to the
>  > thread creation mechanism (which linux calls "clone()", just to be
>  > different).
>
> that wasn't what i was asking.  i was referring to this special storage
> class. how does a thread identify who it is?

A long time ago, it was literally a "gettid()" system call. If you wanted
the thread-local space, you followed that by a index lookup.

It's not insanely expensive if you avoid re-generating the thread-local
pointer all the time, and pass it down as a regular argument, but it is
obviously syntactically not pretty.

These days - mostly thanks to compiler and library advances, not so much
any real kernel changes - the thread infrastructure sets up its local
pointers in registers, so that you can use the above "__thread"
specifier in the compiler, and when you access

	tp->id

the compiler will actually generate

	movl    %gs:tp@NTPOFF, %eax
	movl    (%eax), %eax

for you (on other platforms that have compiler support of thread-local
storage it usually would end up being a indirect access through a regular
register).

The linker fixes these things up, the same way it does things like GOT
tables etc.

> ah, i see in later mail that you answered this. there are now pointers
> created in the user space (i think) to thread-local storage.  how is it
> accessed, that is, how does the user process derive the pointer to it?
> this state stuff did not exist when we did the inferno port.

See above. If you control your environment (ie you don't have to worry
about having arbitrary TLS space), you can do better with the stack
register trick the kernel uses, but indirection will handle the general
case.

> it will work; it's the magic address bits hack.  which kernel version
> introduced this stuff?  i've heard people say that 2.6 is the first one
> with the default thread model being 'efficient' and 'good', but i don't
> know the specifics.  i've also heard that they can be retrofitted to
> 2.4.

The new threading model in 2.6.x is really more about signal handling than
anything else. The _real_ problem with the original clone() implementation
had nothing to do with the VM, and had everything to do with insane POSIX
shared signal semantics. It's really hard to get the POSIX thread signal
semantics rigt, since the whole pthreads thing really was designed for
having all threads run within one master process, and Linux never had the
notion of "process vs threads".

The signal case that is hard to get right in POSIX is the fact that signal
masks are thread-local, yet their effect is "process global" (ie when you
change the signal mask of your thread, that means that you suddenly now
potentially start accepting pending signals that were shared process
global). I still don't like that POSIX model, and I didn't see any sane
way to do it efficiently with truly independent threads that don't have
the notion of a "process" that encompasses them.

What 2.6.x (and the 2.4.x back-port) does is to just accept the fact that
there is a "thread group leader" (that's what the CLONE_THREAD flag does:
if it is set, you share the thread group leader, if it is clear you create
a new thread group), and that pending signal state really is shared in the
thread group.

The VM side has always been the same: if you share the VM, you share
everything. There literally isn't any thread-local storage from a VM
standpoint, there are only thread-local registers that point to different
areas of memory.

> it's interesting you advocate using registers for the magic storage
> class. it's a great trick when you can do it - plan 9 uses it in the
> kernel on machines with lots of registers - but it's not so great on a
> machine with too few registers, like the x86.

Well, even in the absense of a register, you can always just have a system
call to ask what the pointer should be. That really does work very well,
as long as your programming model is about explicit thread pointers (which
pthreads is) so that you don't have to do it all the time.

And the x86 really is the worst possible case, in this situation, because
it is so register-starved anyway. But happily, it has some (very ugly)
legacy registers that have to be user-visible, and have to be saved and
restored anyway, and that nobody sane really wants to use, so the thread
model can use them.

Making the threaded stuff explicit helps avoid confusion. Now, if somebody
takes an address of a per-thread variable, it is clear that that address
is the address of the variable IN THAT THREAD. You can pass it along, but
when you pass it along to another thread, it doesn't change value - it
still points to the exact same thread-local variable in the _original_
thread.

(Obviously you can pass around offsets to the thread-local space if you
want to, although I can't really see why you'd do it).

And I hope it's clear by now that because the thing is entirely in
registers, that "thread model" is pretty much all in user space. It needs
no kernel support, although some of the interfaces are obviously done
certain ways to make it easier to do (ie the kernel does know about a TLS
pointer at thread setup, even if the kernel doesn't actually _use_ it for
anything, it just sets up the register state as indicated by the parent of
the thread).

			Linus