From mboxrd@z Thu Jan 1 00:00:00 1970 From: Linus Torvalds To: Scott Schwartz , Rob Pike Cc: 9fans@cse.psu.edu Subject: Re: [9fans] Re: Threads: Sewing badges of honor onto a Kernel In-Reply-To: <20040227081500.14366.qmail@g.galapagos.bx.psu.edu> Message-ID: References: <20040227081500.14366.qmail@g.galapagos.bx.psu.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Date: Fri, 27 Feb 2004 00:58:30 -0800 Topicbox-Message-UUID: fe201b00-eacc-11e9-9e20-41e7f4b1d025 On Fri, 27 Feb 2004, Rob Pike wrote: > > > So in a C/UNIX-like environment, private stacks are wrong. You could > > imagine _other_ environments where they might be valid, but even those > > other environments would not invalidate my points about efficiency and > > simplicity. > > as i said before, the stacks are not private. you're right, that's a > bad thing. > that's why they're not private. > > the segment called 'stack' is private, but that's a different thing. > i stress: stack != stack segment. stack is where your sp is; stack > segment is a figment of the VM system. Well, in another email I already said that "private stack" and "segments" are really exactly the same thing - some people think of segments as paging things, others think of them in the x86 sense, but in the end it all comes down to the fact that a "stack address" ends up making sense only within a specific context (and that context can sometimes be partially visible to other threads by using explicit segment registers or other magic, like special instructions that can take another address space). And private/segmented stacks are bad. They are bad exactly because they magically make automatic variables fundamentally different from other variables. And they really have no reason they should be different. There is absolutely nothing wrong with having a thread take the address of some automatic variable, and then just pass that address off to another routine. And if that other routine decides that it is going to create a hundred threads to solve the problem that the variable described in parallell, then that should JUST WORK. Anything else would be EVIL. Having a pointer that sometimes works, and sometimes doesn't, based on who uses it - that's just crazy talk. > i ask again: how does linux create per-thread storage? The same way it creates any other storage: with mmap() and brk(). You just malloc the thing, and you pass in the new stack as an argument to the thread creation mechanism (which linux calls "clone()", just to be different). And because that storage is just storage, things like the one I described above "just work". If you pass another thread a pointer to your stack, the other thread can happily manipulate it, and never even needs to know that it's an automatic variable somewhere else. And sure, you can shoot yourself in the foot that way. You can pass off a pointer to another thread, and then return from the function without synchronizing with that other thread properly, and now the other thread will scribble all over your stack. But that's really nothing different than using "alloca()" and passing off that to something that remembers the address. > the way the plan 9 thread library works is so different from linux's > that they're hard to compare. program design in the two worlds is > radically different. so your claim of 'better' is curious to me. by > 'better' you seem to mean 'faster' and 'cleaner'. faster at least can > be measured. To me, the final decision on "better" tends to be a fairly wide issue. Performance is part of it - especially infrastructure that everybody depends on should always strive to at least _allow_ good performance, even if not everybody ends up caring. But the concept, to me, is more important. Basically, I do not see how you can really have a portable and even _remotely_ efficient partial VM sharing. And if you can't have it, then you shouldn't design the interfaces around it. > you speak with certainty. have you seen performance comparisons? i > haven't, although it wouldn't surprise me to learn that there are useful > programs for which linux outperforms plan 9, and vice versa of course. When it comes to threads, I only see three interesting performance metrics: how fast can you create them, how fast can you synchronize them (both "join" and locking), and how well do you switch between them. The locking is pretty much OS-independent, since fast locking has to be done in user space anyway (with just the contention case falling back to the OS, and if your app cares about performance it hopefully won't have much contention). So we're left with create, tear-down and switch. All of which are _fundamentally_ faster if you just have a "share everything" model. Create and tear-down are just increment/decrement a reference counter (there's a spinlock involved too). Task switch is a no-op from a VM standpoint (except we have a per-thread lazy TLB invalidate that will trigger). In contrast, partial sharing is a major pain. You definitely don't just do a reference count increment for your VM. Linus