From mboxrd@z Thu Jan  1 00:00:00 1970
MIME-Version: 1.0
In-Reply-To: <f75780240904161214lf195ca3s43f80663cad4e879@mail.gmail.com>
References: <9ab217670904161047w56b70b74ke25a0280b0f70cc2@mail.gmail.com>
	<a6793f45b20623aa954fb533036477c8@quanstro.net>
	<f75780240904161214lf195ca3s43f80663cad4e879@mail.gmail.com>
Date: Thu, 16 Apr 2009 16:10:38 -0400
Message-ID: <9ab217670904161310xc49286dv247689443b6d18e6@mail.gmail.com>
From: "Devon H. O'Dell" <devon.odell@gmail.com>
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Subject: Re: [9fans] security questions
Topicbox-Message-UUID: de82d0dc-ead4-11e9-9d60-3106f5b1d025

2009/4/16 Venkatesh Srinivas <me@acm.jhu.edu>:
> Devlimit / Rlimit is less than ideal - the resource limits aren't
> adaptive to program needs and to resource availability. They would be
> describing resources that user programs have very little visible
> control over (kernel resources), except by changing their syscall mix
> or giving up a segment or so. Or failing outright.

Right, but that's part of the point. They have very little visible
control over said resources, but the kernel does need to impose
limitations on the allowable resources, more below. Either way, I
agree. This form of resource limitation sucks. My reasoning is
different, however: I feel that the number of places you need to add
new code and verify that any future changes respect these sorts of
limitations. This is much more difficult to do than when the
limitation is built into the allocator.

> Prohibitions per-user are kinda bad in general - what if you want to
> run a potentially hostile (or more likely buggy) program? You can
> already run it in its own ns, but having it be able to stab you via
> kernel resources, about which you can do nothing, is bad.

This depends on your perspective. If you are an administrator, and you
are running a system that provides access to a plethora of users, your
viewpoint is different than if you are a programmer, writing an
application with expensive needs. A user who wants to run a
potentially hostile program should not be able to affect the system to
the point that other users have their own negative experiences. A user
running a buggy program that hogs a ton of memory is not such a big
deal. Buy more memory. A user running a buggy program that runs the
kernel out of resources, causing it to halt is a big deal.

> The typed allocator is worth looking at for speed reasons - the slab
> allocator and customalloc have shown that its faster (from the
> perspective of allocation time, fragmentation) to do things that way.
> But I don't really see it addressing the problem? Now the constraints
> are per-resource, but they're still magic constraints.

This is the pitfall of all tunables. Unless someone pulls out a
calculator (or is mathematically brilliant), figuring out exact
numbers for X number of R resources spread between N users on a system
with Y amount of memory is silly.

Fortunately, an administrator makes educated best guesses based upon
these same factors:

1) The expected needs of the users of the system (and in other cases,
the absolute maximum needs of the users)
2) The limitations of the system itself. My laptop has 2GB of RAM, the
Plan 9 kernel only sits in 256MB of that. Now, assuming I have a
program that's able to allocate 10,000 64 byte structures a second, I
can panic the system in under two minutes. Does that mean I want to
limit my users to under 10,000 of those structures? Not necessarily,
but if I expect 40 users at a given time, I might want to make sure
that number is well under 100,000.

While it may not be perfectly ideal, it allows the administrator to
maintain control over the system. Additionally, there's typically
always a way to turn them off (echo 0 > /dev/constraint/resource,
hypothetically speaking). In this respect, I don't see how it doesn't
address the problem...

...Unless you consider that using Pools for granular limitations isn't
the best idea. In this light, perhaps additional pools aren't the
correct answer. While creating an individual pool per limited resource
serves to implement a hard limit on that resource, it's also required
to have a maximum amount of memory. So if you ditch that idea and just
make typed allocations, the problem is `solved':

When an allocation takes place, we check various heuristics to
determine whether or not the allocation is valid. Does the user have
over X Fids? Does the process hold open more than Y ports?

If these heuristics fail, the memory is not allocated, and the program
takes whatever direction it takes when it does not have resources.
(Crash, complain, happily move forward, whatever).

If they pass, the program gets what it needs, and {crashes, complains,
happily moves forward}.

One can indirectly (and more consistently) limit the number of
allocated resources in this fashion (indeed, the number of open file
descriptors) by determining the amount of memory consumed by that
resource as proportional to the size of the resource. If I as a user
have 64,000 allocations of type Foo, and struct Foo is 64 bytes, then
I hold 1,000 Foos.

The one unfortunate downside to this is that implementing this as an
allocation limit does not make it `provably safe.' That is to say, if
I create a kernel subsystem that allocates Foos, and I allocate
everything with T_MANAGED, there is no protection on the number of
Foos I allocate (assuming T_MANAGED means that this is a memory
allocation I manage myself. It's therefore not provably safer, in
terms of crashing the kernel, and I haven't been able to come up with
an idea that is provably safe (unless type determination is done using
getcallerpc, which would result in an insanely large amount of
tunables and would be completely impractical).

Extending the API in this fashion, however, almost ensures that this
case will not occur. Since the programmer must specify a type for
allocation, they must be aware of the API and the reasoning (or at
least we'd all like to hope so). If a malicious person is able to load
unsafe code into the kernel, you're screwed anyway. So really, this
project is more to protect the diligent and less to help the lazy.
(The lazy are all going to have for (i in `{ls /dev/constraint}) {
echo 0 > $i } in their {term,cpu}rc anyway.)

> Something that might be interesting would be for each primitive pgroup
> to be born with a maximum percentage 'under pressure' associated with
> each interesting resource. Child pgroups would allocate out of their
> parent's pgroup limits. Until there is pressure, limits could be
> unenforced, leading to higher utilization than magic constants in
> rlimit.
>
> To give a chance for a process to give up some of its resources
> (caches, recomputable data) under pressure, there could be a
> per-process /dev/halp device, which a program could read; reads would
> block until a process was over its limit and something else needed
> some more soup. If the app responds, great. Otherwise, something like
> culling it or swapping it out (assuming that swap still worked :)) or
> even slowing down its allocation rate artificially might be answers...

Programs consuming user memory aren't an issue, in case that wasn't
clear. It's programs that consume kernel memory indirectly due to
their own behavior. I think you get this based on some of the points
you raised earlier, but I just wanted to make sure. For instance,
reading a from a resource is extremely unlikely to cause issues inside
the kernel -- to read the data, all the structures for passing the
data must already be allocated. If the data came over a socket (as
erik pointed out), that memory is pre-allocated or of a fixed size.

> If people are interested, there is some work going on in a research
> kernel called Viengoos
> (http://www.gnu.org/software/hurd/microkernel/viengoos.html) (gasp!
> the hurd!) trying to address pretty much the same problem...

I read the paper, but it's hard to see how exactly this would apply to
this problem. There's a strong bias there towards providing programs
with better scheduling / more memory / more allowable resources if the
program is well behaved. This is interesting for improving user
experience of programs, but I feel like there are two drawbacks from
the outcome based on the problem I'm trying to solve:

1) Programmers are required to do more work to guarantee that their
programs won't be affected by the system
2) You still have to have hard limits (in this case, arbitrarily based
on percentage) to avoid a user program running the kernel out of
resources. At the end of the day, you must have a limit that is lower
than the maximum amount minus any overhead from managed resources. Any
solution will overcommit, but a percentage-based solution seems more
difficult to tune.

Additionally, it's much more complex (and thus much more prone to
error in multiple cases). Since the heuristics for determining the
resource limits would be automated, it's not necessarily provable that
someone couldn't find a way to subvert the limitations. It adds
complexity to the scheduler, to the slab allocator, and to any area of
code that would need to check resources (someone adding a new resource
then needs to do much more work than registering a new memory type and
using that for allocation). Quite frankly, the added complexity scares
me a bit.

Perhaps I am missing something, so if you can address those points,
that would be good.

> -- vs

--dho