From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/6526
Path: news.gmane.org!not-for-mail
From: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
Newsgroups: gmane.linux.kernel.api,gmane.comp.lib.glibc.alpha,gmane.linux.lib.musl.general
Subject: Re: [musl] Re: [RFC] Possible new execveat(2) Linux syscall
Date: Sun, 16 Nov 2014 16:06:10 -0800
Message-ID: <CALCETrVpm9-Osa+MRs8oxmmDNAJDCC1p4Z8dXiEnAb_WBsNP-Q@mail.gmail.com>
References: <CAHse=S8ccC2No5EYS0Pex=Ng3oXjfDB9woOBmMY_k+EgxtODZA@mail.gmail.com>
 <20141116195246.GX22465@brightrain.aerifal.cx> <CALCETrWWUyizL8HxZKaYE+xuV5eGi8mQcequT9HPvvac=X-dLg@mail.gmail.com>
 <20141116220859.GY22465@brightrain.aerifal.cx> <CALCETrVtN73rTxGXV9Xt+sPOitAWCcyrfUWY_3_tAmd+n6V1gA@mail.gmail.com>
 <20141116233202.GA22465@brightrain.aerifal.cx>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
X-Trace: ger.gmane.org 1416182803 12468 80.91.229.3 (17 Nov 2014 00:06:43 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Mon, 17 Nov 2014 00:06:43 +0000 (UTC)
Cc: libc-alpha <libc-alpha-9JcytcrH/bA+uJoB2kUjGw@public.gmane.org>, musl-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8@public.gmane.org,
	Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Linux API <linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
To: Rich Felker <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org>
Original-X-From: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Mon Nov 17 01:06:36 2014
Return-path: <linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Envelope-to: glka-linux-api-wOFGN7rlS/M9smdsby/KFg@public.gmane.org
Original-Received: from vger.kernel.org ([209.132.180.67])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>)
	id 1Xq9pz-0005WD-J8
	for glka-linux-api-wOFGN7rlS/M9smdsby/KFg@public.gmane.org; Mon, 17 Nov 2014 01:06:36 +0100
Original-Received: (majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org) by vger.kernel.org via listexpand
	id S1751544AbaKQAGd (ORCPT <rfc822;glka-linux-api@m.gmane.org>);
	Sun, 16 Nov 2014 19:06:33 -0500
Original-Received: from mail-la0-f44.google.com ([209.85.215.44]:53979 "EHLO
	mail-la0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751527AbaKQAGd (ORCPT
	<rfc822;linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>); Sun, 16 Nov 2014 19:06:33 -0500
Original-Received: by mail-la0-f44.google.com with SMTP id hz20so7393550lab.3
        for <linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>; Sun, 16 Nov 2014 16:06:31 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20130820;
        h=x-gm-message-state:mime-version:in-reply-to:references:from:date
         :message-id:subject:to:cc:content-type;
        bh=vKbRVv9FaTP0D7u9jBffWuprBebLLI5HFUpyArPDcIY=;
        b=gwMROJwBhQ9QO76D6cpOEPenHhAz3xQpaNczAtzfgTWQPU3aWk885vecqA080rnoWJ
         rRKj8kFOwvv3bv/qX4ZVZ3GCpPGXX9iVsFV9IoIEZLsPO6JaDOYnbTPImvUAp9dmZnSx
         vQL+OenzQWAdGSowb0zKJMdMbfa5qqUXPR52q/wgsz5bml3YuBvxQ2uNoMPhMbVGvA1n
         wCLi5F9sg2WAQj8sfJSQkwB8W23OprboM+Yl3h2PvxvK6qNqZnywwje4M86eAf8giN0X
         U9Z2RFnW7PGVXudaNAOPREIrlnJ2tCGKLSEo9YcYn0Cw98hR9BK31bO5DtL4FdqjgoP9
         CUCQ==
X-Gm-Message-State: ALoCoQl5F8/kfrvewysnD9jxW/EXWuhLbSoJ+jTKdUd+esQp4yYrKb8mWCukFdypWS30C8DSBKPr
X-Received: by 10.112.168.97 with SMTP id zv1mr23830931lbb.6.1416182791214;
 Sun, 16 Nov 2014 16:06:31 -0800 (PST)
Original-Received: by 10.152.7.170 with HTTP; Sun, 16 Nov 2014 16:06:10 -0800 (PST)
In-Reply-To: <20141116233202.GA22465-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>
Original-Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Precedence: bulk
List-ID: <linux-api.vger.kernel.org>
X-Mailing-List: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Xref: news.gmane.org gmane.linux.kernel.api:6169 gmane.comp.lib.glibc.alpha:46712 gmane.linux.lib.musl.general:6526
Archived-At: <http://permalink.gmane.org/gmane.linux.kernel.api/6169>

On Sun, Nov 16, 2014 at 3:32 PM, Rich Felker <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org> wrote:
> On Sun, Nov 16, 2014 at 02:34:32PM -0800, Andy Lutomirski wrote:
>> On Sun, Nov 16, 2014 at 2:08 PM, Rich Felker <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org> wrote:
>> > On Sun, Nov 16, 2014 at 01:20:39PM -0800, Andy Lutomirski wrote:
>> >> On Nov 16, 2014 11:53 AM, "Rich Felker" <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org> wrote:
>> >> >
>> >> > On Fri, Nov 14, 2014 at 02:54:19PM +0000, David Drysdale wrote:
>> >> > > Hi,
>> >> > >
>> >> > > Over at the LKML[1] we've been discussing a possible new syscall, execveat(2),
>> >> > > and it would be good to hear a glibc perspective about it (and whether there
>> >> > > are any interface changes that would make it easier to use from userspace).
>> >> > >
>> >> > > The syscall prototype is:
>> >> > >   int execveat(int fd, const char *pathname,
>> >> > >                       char *const argv[],  char *const envp[],
>> >> > >                       int flags); /* AT_EMPTY_PATH, AT_SYMLINK_NOFOLLOW */
>> >> > > and it works similarly to execve(2) except:
>> >> > >  - the executable to run is identified by the combination of fd+pathname, like
>> >> > >    other *at(2) syscalls
>> >> > >  - there's an extra flags field to control behaviour.
>> >> > > (I've attached a text version of the suggested man page below)
>> >> > >
>> >> > > One particular benefit of this is that it allows an fexecve(3) implementation
>> >> > > that doesn't rely on /proc being accessible, which is useful for sandboxed
>> >> > > applications.  (However, that does only work for non-interpreted programs:
>> >> > > the name passed to a script interpreter is of the form "/dev/fd/<fd>/<path>"
>> >> > > or "/dev/fd/<fd>", so the executed interpreter will normally still need /proc
>> >> > > access to load the script file).
>> >> > >
>> >> > > How does this sound from a glibc perspective?
>> >> >
>> >> > I've been following the discussions so far and everything looks mostly
>> >> > okay. There are still issues to be resolved with the different
>> >> > semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
>> >> > O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
>> >> > save the permissions at the time of open and cause them to be used in
>> >> > place of the current file permissions at the time of execveat
>> >>
>> >> Is something missing here?
>> >>
>> >> FWIW, I don't understand O_PATH or O_EXEC very well, so from my POV,
>> >> help would be appreciated.
>> >
>> > Yes. POSIX requires that permission checks for execution (fexecve with
>> > O_EXEC file descriptors) and directory-search (*at functions with
>> > O_SEARCH file descriptors) succeed if the open operation succeeded --
>> > the permissions check is required to take place at open time rather
>> > than at exec/search time. There's a separate discussion about how to
>> > make this work on the kernel side.
>>
>> It may be worth making this work as part of adding execveat to the
>> kernel.  Does the kernel even have O_EXEC right now?
>
> No. The proposal is that O_EXEC and O_SEARCH would both be equal to
> O_PATH|3 (3 being the rarely-used O_ACCMODE for "neither read or
> write, but some weird ioctls are accepted") which gracefully falls
> back for both current kernels with O_PATH (in which case the 3 is
> ignored and the discrepency from POSIX is just the time at which
> permissions are checked) and for pre-O_PATH kernels (in which case the
> access mode used is 3, and read/write ops fail on the fd, but it's
> still usable for fexecve and *at functions with /proc-based fallback
> implementations).
>
> I would be happy to see this work get done at the same time.
>
>> >> > One major issue however is FD_CLOEXEC with scripts. Last I checked,
>> >> > this didn't work because the file is already closed by the time the
>> >> > interpreted runs. The intended usage of fexecve is almost certainly to
>> >> > call it with the file descriptor set close-on-exec; otherwise, there
>> >> > would be no clean way to close it, since the program being executed
>> >> > doesn't know that it's being executed via fexecve. So this is a
>> >> > serious problem that needs to be solved if it hasn't already. I have
>> >> > some ideas I could offer, but I'm not an expert on the kernel side
>> >> > things so I'm not sure they'd be correct.
>> >>
>> >> Bring on the ideas.
>> >
>> > My thought is that when the kernel opens the binary and sees that it's
>> > a script that needs an interpreter, the kernel should not pass
>> > /proc/self/fd/%d to the interpreter, but instead should pass the name
>> > of a new magic symlink in /proc/self that's connected to the inode for
>> > the script to be executed but that ceases to exist as soon as it's
>> > opened. In theory this could also be used for suid scripts to make
>> > them secure.
>>
>> This doesn't help if /proc is not mounted, which is an important use case.
>
> I don't know what can be done in this case short of some really ugly
> hacks, like giving open() special behavior when the pathname points to
> a magic address in the argv region, or having the kernel create temp
> files in some magic path.
>
>> >> FWIW, I've often thought that interpreter binaries should mark
>> >> themselves as such to enable better interactions with the kernel.
>> >
>> > That's hard since users expect to be able to use arbitrary
>> > interpreters (and sometimes even pass through multiple ones, e.g.
>> > #!/usr/bin/env perl).
>>
>> Hmm.  I'd be okay with old interpreters having a somewhat degraded experience.
>>
>> I guess that #!/some/interpreted/script isn't allowed, but maybe
>> #!/usr/bin/env some-interpreted-script should work.
>>
>> It could be that all that's really needed is some convention to tell
>> an interpreter that it should use fd N as a script *and close it*.
>> Something like /dev/fd_and_close/N could work, but that has all kinds
>> of problems.
>>
>> Alternatively, if we could have a way to mark an fd so that it's
>> close-on-exec after exec, that would solve the nesting problem, as
>> long as every interpreter in the chain does it.  And the kernel could
>> certainly implement execve on a close-on-exec fd by passing /dev/fd/N
>> where N is a close-on-exec fd, at least in the non-nested case.
>
> This doesn't solve the problem of needing /proc though (/dev/fd is
> just a link to /proc/self/fd).
>

Al Viro was talking about having a special fs just for /dev/fd.  And
interpreters could special-case path names of a certain form.

--Andy