Re: [musl] Re: [RFC] Possible new execveat(2) Linux syscall

mailing list of musl libc
 help / color / mirror / code / Atom feed

From: Rich Felker <dalias@aerifal.cx>
To: Andy Lutomirski <luto@amacapital.net>
Cc: libc-alpha <libc-alpha@sourceware.org>,
	musl@lists.openwall.com,
	Andrew Morton <akpm@linux-foundation.org>,
	David Drysdale <drysdale@google.com>,
	Linux API <linux-api@vger.kernel.org>,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: [musl] Re: [RFC] Possible new execveat(2) Linux syscall
Date: Sun, 16 Nov 2014 18:32:02 -0500	[thread overview]
Message-ID: <20141116233202.GA22465@brightrain.aerifal.cx> (raw)
In-Reply-To: <CALCETrVtN73rTxGXV9Xt+sPOitAWCcyrfUWY_3_tAmd+n6V1gA@mail.gmail.com>

On Sun, Nov 16, 2014 at 02:34:32PM -0800, Andy Lutomirski wrote:
> On Sun, Nov 16, 2014 at 2:08 PM, Rich Felker <dalias@aerifal.cx> wrote:
> > On Sun, Nov 16, 2014 at 01:20:39PM -0800, Andy Lutomirski wrote:
> >> On Nov 16, 2014 11:53 AM, "Rich Felker" <dalias@aerifal.cx> wrote:
> >> >
> >> > On Fri, Nov 14, 2014 at 02:54:19PM +0000, David Drysdale wrote:
> >> > > Hi,
> >> > >
> >> > > Over at the LKML[1] we've been discussing a possible new syscall, execveat(2),
> >> > > and it would be good to hear a glibc perspective about it (and whether there
> >> > > are any interface changes that would make it easier to use from userspace).
> >> > >
> >> > > The syscall prototype is:
> >> > >   int execveat(int fd, const char *pathname,
> >> > >                       char *const argv[],  char *const envp[],
> >> > >                       int flags); /* AT_EMPTY_PATH, AT_SYMLINK_NOFOLLOW */
> >> > > and it works similarly to execve(2) except:
> >> > >  - the executable to run is identified by the combination of fd+pathname, like
> >> > >    other *at(2) syscalls
> >> > >  - there's an extra flags field to control behaviour.
> >> > > (I've attached a text version of the suggested man page below)
> >> > >
> >> > > One particular benefit of this is that it allows an fexecve(3) implementation
> >> > > that doesn't rely on /proc being accessible, which is useful for sandboxed
> >> > > applications.  (However, that does only work for non-interpreted programs:
> >> > > the name passed to a script interpreter is of the form "/dev/fd/<fd>/<path>"
> >> > > or "/dev/fd/<fd>", so the executed interpreter will normally still need /proc
> >> > > access to load the script file).
> >> > >
> >> > > How does this sound from a glibc perspective?
> >> >
> >> > I've been following the discussions so far and everything looks mostly
> >> > okay. There are still issues to be resolved with the different
> >> > semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
> >> > O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
> >> > save the permissions at the time of open and cause them to be used in
> >> > place of the current file permissions at the time of execveat
> >>
> >> Is something missing here?
> >>
> >> FWIW, I don't understand O_PATH or O_EXEC very well, so from my POV,
> >> help would be appreciated.
> >
> > Yes. POSIX requires that permission checks for execution (fexecve with
> > O_EXEC file descriptors) and directory-search (*at functions with
> > O_SEARCH file descriptors) succeed if the open operation succeeded --
> > the permissions check is required to take place at open time rather
> > than at exec/search time. There's a separate discussion about how to
> > make this work on the kernel side.
> 
> It may be worth making this work as part of adding execveat to the
> kernel.  Does the kernel even have O_EXEC right now?

No. The proposal is that O_EXEC and O_SEARCH would both be equal to
O_PATH|3 (3 being the rarely-used O_ACCMODE for "neither read or
write, but some weird ioctls are accepted") which gracefully falls
back for both current kernels with O_PATH (in which case the 3 is
ignored and the discrepency from POSIX is just the time at which
permissions are checked) and for pre-O_PATH kernels (in which case the
access mode used is 3, and read/write ops fail on the fd, but it's
still usable for fexecve and *at functions with /proc-based fallback
implementations).

I would be happy to see this work get done at the same time.

> >> > One major issue however is FD_CLOEXEC with scripts. Last I checked,
> >> > this didn't work because the file is already closed by the time the
> >> > interpreted runs. The intended usage of fexecve is almost certainly to
> >> > call it with the file descriptor set close-on-exec; otherwise, there
> >> > would be no clean way to close it, since the program being executed
> >> > doesn't know that it's being executed via fexecve. So this is a
> >> > serious problem that needs to be solved if it hasn't already. I have
> >> > some ideas I could offer, but I'm not an expert on the kernel side
> >> > things so I'm not sure they'd be correct.
> >>
> >> Bring on the ideas.
> >
> > My thought is that when the kernel opens the binary and sees that it's
> > a script that needs an interpreter, the kernel should not pass
> > /proc/self/fd/%d to the interpreter, but instead should pass the name
> > of a new magic symlink in /proc/self that's connected to the inode for
> > the script to be executed but that ceases to exist as soon as it's
> > opened. In theory this could also be used for suid scripts to make
> > them secure.
> 
> This doesn't help if /proc is not mounted, which is an important use case.

I don't know what can be done in this case short of some really ugly
hacks, like giving open() special behavior when the pathname points to
a magic address in the argv region, or having the kernel create temp
files in some magic path.

> >> FWIW, I've often thought that interpreter binaries should mark
> >> themselves as such to enable better interactions with the kernel.
> >
> > That's hard since users expect to be able to use arbitrary
> > interpreters (and sometimes even pass through multiple ones, e.g.
> > #!/usr/bin/env perl).
> 
> Hmm.  I'd be okay with old interpreters having a somewhat degraded experience.
> 
> I guess that #!/some/interpreted/script isn't allowed, but maybe
> #!/usr/bin/env some-interpreted-script should work.
> 
> It could be that all that's really needed is some convention to tell
> an interpreter that it should use fd N as a script *and close it*.
> Something like /dev/fd_and_close/N could work, but that has all kinds
> of problems.
> 
> Alternatively, if we could have a way to mark an fd so that it's
> close-on-exec after exec, that would solve the nesting problem, as
> long as every interpreter in the chain does it.  And the kernel could
> certainly implement execve on a close-on-exec fd by passing /dev/fd/N
> where N is a close-on-exec fd, at least in the non-nested case.

This doesn't solve the problem of needing /proc though (/dev/fd is
just a link to /proc/self/fd).

Rich

next prev parent reply	other threads:[~2014-11-16 23:32 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAHse=S8ccC2No5EYS0Pex=Ng3oXjfDB9woOBmMY_k+EgxtODZA@mail.gmail.com>
2014-11-16 19:52 ` Rich Felker
     [not found]   ` <20141116195246.GX22465-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>
2014-11-16 21:20     ` Andy Lutomirski
2014-11-16 22:08       ` Rich Felker
     [not found]         ` <20141116220859.GY22465-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>
2014-11-16 22:34           ` Andy Lutomirski
2014-11-16 23:32             ` Rich Felker [this message]
     [not found]               ` <20141116233202.GA22465-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>
2014-11-17  0:06                 ` [musl] " Andy Lutomirski
2014-11-17 15:42                 ` David Drysdale
2014-11-17 18:30                   ` Rich Felker
2014-11-21 10:10                     ` Christoph Hellwig
2014-11-21 10:13   ` Christoph Hellwig
     [not found]     ` <20141121101318.GG8866-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2014-11-21 13:49       ` David Drysdale
     [not found]         ` <CAHse=S9RATqvXSrFXxDOcWx7Ub94Yhyr_-=USib-PPMx+_CC-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-11-21 14:15           ` [musl] " Rich Felker
2014-11-21 14:11       ` Rich Felker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141116233202.GA22465@brightrain.aerifal.cx \
    --to=dalias@aerifal.cx \
    --cc=akpm@linux-foundation.org \
    --cc=drysdale@google.com \
    --cc=hch@infradead.org \
    --cc=libc-alpha@sourceware.org \
    --cc=linux-api@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).