From mboxrd@z Thu Jan 1 00:00:00 1970 From: Linus Torvalds To: Geoff Collyer Cc: 9fans@cse.psu.edu Subject: Re: [9fans] Re: Threads: Sewing badges of honor onto a Kernel In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Date: Fri, 27 Feb 2004 01:30:57 -0800 Topicbox-Message-UUID: fecfa91c-eacc-11e9-9e20-41e7f4b1d025 On Fri, 27 Feb 2004, Geoff Collyer wrote: > > I think we're talking past each other due to different terminologies. > Linus seems to use `thread' to mean `a process sharing address space > (other than normal text segment sharing)', whereas in Plan 9, that's > just a process; some share address space, some don't. A Plan 9 thread > is entirely a user-mode creation of the Plan 9 thread library, which > doesn't implement POSIX threads. Well, what Linux has is really what I privately call a "context of execution". The "clone()" system call in Linux just creates a new such "context of execution", and you can choose to arbitrarily share pretty much any OS state, by just saying which state you want to share in a bitmap. In addition to the bitmap there are a few pointers you pass around, the full required state is actually clone_flags: bitmap of how to create the new context of execution newsp: stack pointer of new context parent tidptr: pointer (in the parent) to the thread ID information child tidptr: pointer (in the child) to the thread ID information tls pointer: pointer to TLS (thread-local-storage) for the context but not all of them are necessarily used (ie if you don't want to set TLS or TID information, those pointers are obviously unused). The bits you can control the context copy with are: CSIGNAL /* signal mask to be sent at exit */ CLONE_VM /* set if VM shared between processes */ CLONE_FS /* set if fs info shared between processes */ CLONE_FILES /* set if open files shared between processes */ CLONE_SIGHAND /* set if signal handlers and blocked signals shared */ CLONE_IDLETASK /* set if new pid should be 0 (kernel only)*/ CLONE_PTRACE /* set if we want to let tracing continue on the child too */ CLONE_VFORK /* set if the parent wants the child to wake it up on mm_release */ CLONE_PARENT /* set if we want to have the same parent as the cloner */ CLONE_THREAD /* Same thread group? */ CLONE_NEWNS /* New namespace group? */ CLONE_SYSVSEM /* share system V SEM_UNDO semantics */ CLONE_SETTLS /* create a new TLS for the child */ CLONE_PARENT_SETTID /* set the TID in the parent */ CLONE_CHILD_CLEARTID /* clear the TID in the child */ CLONE_DETACHED /* Unused, ignored */ CLONE_UNTRACED /* set if the tracing process can't force CLONE_PTRACE on this clone */ CLONE_CHILD_SETTID /* set the TID in the child */ CLONE_STOPPED /* Start in stopped state */ (CSIGNAL isn't a bit - it's the low 8 bits, and it specifies the signal you want to send to your parent when you die). So a "fork()" is literally really just a "clone(SIGCHLD)". We're saying that we don't want to share anything, and that we want to send a SIGCHLD at exit. Setting the CLONE_VM bit says that the VM gets shared. That means that instead of copying the page tables, we just copy the pointer to the "struct mm_struct", which describes everything in the VM, and we increment its reference count. There is no "partial copy". If you say that you want to share the VM, you get the WHOLE VM. Or you will get a totally private VM. Similarly, i fyou say that you want to share the file descriptors (CLONE_FILES), they will all be shared: one context doing an "open()" will have that fd be valid in all other contexts that share it. (The difference between CLONE_FILES and a regular fork() is that a CLONE_FILES will increment just _one_ reference count: the reference count for the whole array of pointers to files. In contrast, a fork-like non-shared case will create a whole new array of pointers to files, and then for each file increment the pointer for that file). What most "unix people" call threads is somethign that is created with pretty much all flags set - we share pretty much everything except for the register state and the kernel stack between the contexts. And when I say "share", I really mean share: most of the bits end up being copying a kernel pointer and incrementing the reference count for that object. Some of the bits are "administrative": the VFORK bit isn't about sharing, it's about the parent waiting until the child releases the VM back to it (btw, that uses a "completion" structure on the parents stack). Similarly, the SETTID/CLEARTID bits are about writing the TID ("thread ID" as opposed to "process ID") to the VM space atomically with the creation (or in the case of CLEARTID, teardown) of the thread. That ends up helping the thread management (from user space) a _lot_. (Tangential to this discussion is the TLS or "thread local storage" bit - some architecture-specific way of indicating a small thread-specific storage area. It's not the stack, it's just a regular allocation, and different architectures have different ways of pointing to it. Usually there's some architected register set aside for it). And you can mix and match things. You can literally create a new context that shares the file descriptors (so that one process doing an "open()" will open files in the other one), but doesn't share the VM space. Although some of them are interdependent - CLONE_THREAD (which is really just "all signal state" despite the name - it has nothing to do with VM per se) depends on CLONE_SIGHAND (which is just the set of signal handlers associated with the context), which in turn depends on CLONE_VM (because it doesn't make sense to be able to take a signal in different contexts unless they share the same VM). This has gotten fairly far off the notion of stacks and VM.. But I hope it's clear to everybody that I heartily agree with rfork-like functionality. It's just segmented/private stacks I can't understand. Linus