From mboxrd@z Thu Jan  1 00:00:00 1970
From: Linus Torvalds <torvalds@osdl.org>
To: Geoff Collyer <geoff@collyer.net>
Cc: 9fans@cse.psu.edu
Subject: Re: [9fans] Re: Threads: Sewing badges of honor onto a Kernel
In-Reply-To: <a7ec3985c6aeabd43949005aa6af0f67@collyer.net>
Message-ID: <Pine.LNX.4.58.0402270058480.2563@ppc970.osdl.org>
References: <a7ec3985c6aeabd43949005aa6af0f67@collyer.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Date: Fri, 27 Feb 2004 01:30:57 -0800
Topicbox-Message-UUID: fecfa91c-eacc-11e9-9e20-41e7f4b1d025



On Fri, 27 Feb 2004, Geoff Collyer wrote:
>
> I think we're talking past each other due to different terminologies.
> Linus seems to use `thread' to mean `a process sharing address space
> (other than normal text segment sharing)', whereas in Plan 9, that's
> just a process; some share address space, some don't.  A Plan 9 thread
> is entirely a user-mode creation of the Plan 9 thread library, which
> doesn't implement POSIX threads.

Well, what Linux has is really what I privately call a "context of
execution".

The "clone()" system call in Linux just creates a new such "context of
execution", and you can choose to arbitrarily share pretty much any OS
state, by just saying which state you want to share in a bitmap. In
addition to the bitmap there are a few pointers you pass around, the full
required state is actually

	clone_flags: bitmap of how to create the new context of execution
	newsp: stack pointer of new context
	parent tidptr: pointer (in the parent) to the thread ID information
	child tidptr: pointer (in the child) to the thread ID information
	tls pointer: pointer to TLS (thread-local-storage) for the context

but not all of them are necessarily used (ie if you don't want to set TLS
or TID information, those pointers are obviously unused).

The bits you can control the context copy with are:

  CSIGNAL         	/* signal mask to be sent at exit */

  CLONE_VM        	/* set if VM shared between processes */
  CLONE_FS        	/* set if fs info shared between processes */
  CLONE_FILES     	/* set if open files shared between processes */
  CLONE_SIGHAND   	/* set if signal handlers and blocked signals shared */
  CLONE_IDLETASK  	/* set if new pid should be 0 (kernel only)*/
  CLONE_PTRACE    	/* set if we want to let tracing continue on the child too */
  CLONE_VFORK     	/* set if the parent wants the child to wake it up on mm_release */
  CLONE_PARENT    	/* set if we want to have the same parent as the cloner */
  CLONE_THREAD    	/* Same thread group? */
  CLONE_NEWNS     	/* New namespace group? */
  CLONE_SYSVSEM   	/* share system V SEM_UNDO semantics */
  CLONE_SETTLS    	/* create a new TLS for the child */
  CLONE_PARENT_SETTID   /* set the TID in the parent */
  CLONE_CHILD_CLEARTID  /* clear the TID in the child */
  CLONE_DETACHED        /* Unused, ignored */
  CLONE_UNTRACED        /* set if the tracing process can't force CLONE_PTRACE on this clone */
  CLONE_CHILD_SETTID    /* set the TID in the child */
  CLONE_STOPPED         /* Start in stopped state */

(CSIGNAL isn't a bit - it's the low 8 bits, and it specifies the signal
you want to send to your parent when you die).

So a "fork()" is literally really just a "clone(SIGCHLD)". We're saying
that we don't want to share anything, and that we want to send a SIGCHLD
at exit.

Setting the CLONE_VM bit says that the VM gets shared. That means that
instead of copying the page tables, we just copy the pointer to the
"struct mm_struct", which describes everything in the VM, and we increment
its reference count.

There is no "partial copy". If you say that you want to share the VM, you
get the WHOLE VM. Or you will get a totally private VM. Similarly, i fyou
say that you want to share the file descriptors (CLONE_FILES), they will
all be shared: one context doing an "open()" will have that fd be valid in
all other contexts that share it.

(The difference between CLONE_FILES and a regular fork() is that a
CLONE_FILES will increment just _one_ reference count: the reference count
for the whole array of pointers to files. In contrast, a fork-like
non-shared case will create a whole new array of pointers to files, and
then for each file increment the pointer for that file).

What most "unix people" call threads is somethign that is created with
pretty much all flags set - we share pretty much everything except for the
register state and the kernel stack between the contexts. And when I say
"share", I really mean share: most of the bits end up being copying a
kernel pointer and incrementing the reference count for that object.

Some of the bits are "administrative": the VFORK bit isn't about sharing,
it's about the parent waiting until the child releases the VM back to it
(btw, that uses a "completion" structure on the parents stack). Similarly,
the SETTID/CLEARTID bits are about writing the TID ("thread ID" as opposed
to "process ID") to the VM space atomically with the creation (or in the
case of CLEARTID, teardown) of the thread. That ends up helping the thread
management (from user space) a _lot_.

(Tangential to this discussion is the TLS or "thread local storage" bit -
some architecture-specific way of indicating a small thread-specific
storage area. It's not the stack, it's just a regular allocation, and
different architectures have different ways of pointing to it. Usually
there's some architected register set aside for it).

And you can mix and match things. You can literally create a new context
that shares the file descriptors (so that one process doing an "open()"
will open files in the other one), but doesn't share the VM space.

Although some of them are interdependent - CLONE_THREAD (which is really
just "all signal state" despite the name - it has nothing to do with VM
per se) depends on CLONE_SIGHAND (which is just the set of signal handlers
associated with the context), which in turn depends on CLONE_VM (because
it doesn't make sense to be able to take a signal in different contexts
unless they share the same VM).

This has gotten fairly far off the notion of stacks and VM.. But I hope
it's clear to everybody that I heartily agree with rfork-like
functionality. It's just segmented/private stacks I can't understand.

			Linus