From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/6525 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.comp.lib.glibc.alpha,gmane.linux.lib.musl.general,gmane.linux.kernel.api Subject: Re: [musl] Re: [RFC] Possible new execveat(2) Linux syscall Date: Sun, 16 Nov 2014 18:32:02 -0500 Message-ID: <20141116233202.GA22465@brightrain.aerifal.cx> References: <20141116195246.GX22465@brightrain.aerifal.cx> <20141116220859.GY22465@brightrain.aerifal.cx> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1416180740 16102 80.91.229.3 (16 Nov 2014 23:32:20 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 16 Nov 2014 23:32:20 +0000 (UTC) Cc: libc-alpha , musl@lists.openwall.com, Andrew Morton , David Drysdale , Linux API , Christoph Hellwig To: Andy Lutomirski Original-X-From: libc-alpha-return-54415-glibc-alpha=m.gmane.org@sourceware.org Mon Nov 17 00:32:14 2014 Return-path: Envelope-to: glibc-alpha@plane.gmane.org Original-Received: from server1.sourceware.org ([209.132.180.131] helo=sourceware.org) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Xq9Ij-00041q-P7 for glibc-alpha@plane.gmane.org; Mon, 17 Nov 2014 00:32:14 +0100 DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:cc:subject:message-id:references :mime-version:content-type:in-reply-to; q=dns; s=default; b=Gp7n Ekad7ibUXULWtfeYAE5U5SPuJsj+ImhHQ3qgarXK+9yviTKrUWEDQRMo79ZmEk6n Laav5wc5M1dXcrrS9QkJZtkqphvgnAEcWK/0KNUzo362SO/9v1EzWz98DtJo9dzX DsS/ypOJnwrj6dXibzHkc4+eAVExTA6xQ2rfsIA= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:cc:subject:message-id:references :mime-version:content-type:in-reply-to; s=default; bh=a2HJHft3UM IUV4N4rhTjhExLIME=; b=A+SDPLKwOzoIvfwoAqY8coWKG2nFdA6U/l2jKNMngY T/AJpdHNhAeO5w4jTef2qh8uVsl6rw4Cm6InaNmgAZ8Z3H5q7fjlkNayKnjJv5oa D0LsMxhe2z4+cMxyfTBeYsnGJUCQn5TysNE+gDS7ltQyN9g82lTAmwAvmMQAUe+6 Y= Original-Received: (qmail 300 invoked by alias); 16 Nov 2014 23:32:09 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Original-Sender: libc-alpha-owner@sourceware.org Original-Received: (qmail 32757 invoked by uid 89); 16 Nov 2014 23:32:09 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.5 required=5.0 tests=AWL,BAYES_00,RDNS_DYNAMIC,TVD_RCVD_IP autolearn=no version=3.3.2 X-HELO: brightrain.aerifal.cx Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.comp.lib.glibc.alpha:46711 gmane.linux.lib.musl.general:6525 gmane.linux.kernel.api:6168 Archived-At: On Sun, Nov 16, 2014 at 02:34:32PM -0800, Andy Lutomirski wrote: > On Sun, Nov 16, 2014 at 2:08 PM, Rich Felker wrote: > > On Sun, Nov 16, 2014 at 01:20:39PM -0800, Andy Lutomirski wrote: > >> On Nov 16, 2014 11:53 AM, "Rich Felker" wrote: > >> > > >> > On Fri, Nov 14, 2014 at 02:54:19PM +0000, David Drysdale wrote: > >> > > Hi, > >> > > > >> > > Over at the LKML[1] we've been discussing a possible new syscall, execveat(2), > >> > > and it would be good to hear a glibc perspective about it (and whether there > >> > > are any interface changes that would make it easier to use from userspace). > >> > > > >> > > The syscall prototype is: > >> > > int execveat(int fd, const char *pathname, > >> > > char *const argv[], char *const envp[], > >> > > int flags); /* AT_EMPTY_PATH, AT_SYMLINK_NOFOLLOW */ > >> > > and it works similarly to execve(2) except: > >> > > - the executable to run is identified by the combination of fd+pathname, like > >> > > other *at(2) syscalls > >> > > - there's an extra flags field to control behaviour. > >> > > (I've attached a text version of the suggested man page below) > >> > > > >> > > One particular benefit of this is that it allows an fexecve(3) implementation > >> > > that doesn't rely on /proc being accessible, which is useful for sandboxed > >> > > applications. (However, that does only work for non-interpreted programs: > >> > > the name passed to a script interpreter is of the form "/dev/fd//" > >> > > or "/dev/fd/", so the executed interpreter will normally still need /proc > >> > > access to load the script file). > >> > > > >> > > How does this sound from a glibc perspective? > >> > > >> > I've been following the discussions so far and everything looks mostly > >> > okay. There are still issues to be resolved with the different > >> > semantics between Linux O_PATH and what POSIX requires for O_EXEC (and > >> > O_SEARCH) but as long as the intent is that, once O_EXEC is defined to > >> > save the permissions at the time of open and cause them to be used in > >> > place of the current file permissions at the time of execveat > >> > >> Is something missing here? > >> > >> FWIW, I don't understand O_PATH or O_EXEC very well, so from my POV, > >> help would be appreciated. > > > > Yes. POSIX requires that permission checks for execution (fexecve with > > O_EXEC file descriptors) and directory-search (*at functions with > > O_SEARCH file descriptors) succeed if the open operation succeeded -- > > the permissions check is required to take place at open time rather > > than at exec/search time. There's a separate discussion about how to > > make this work on the kernel side. > > It may be worth making this work as part of adding execveat to the > kernel. Does the kernel even have O_EXEC right now? No. The proposal is that O_EXEC and O_SEARCH would both be equal to O_PATH|3 (3 being the rarely-used O_ACCMODE for "neither read or write, but some weird ioctls are accepted") which gracefully falls back for both current kernels with O_PATH (in which case the 3 is ignored and the discrepency from POSIX is just the time at which permissions are checked) and for pre-O_PATH kernels (in which case the access mode used is 3, and read/write ops fail on the fd, but it's still usable for fexecve and *at functions with /proc-based fallback implementations). I would be happy to see this work get done at the same time. > >> > One major issue however is FD_CLOEXEC with scripts. Last I checked, > >> > this didn't work because the file is already closed by the time the > >> > interpreted runs. The intended usage of fexecve is almost certainly to > >> > call it with the file descriptor set close-on-exec; otherwise, there > >> > would be no clean way to close it, since the program being executed > >> > doesn't know that it's being executed via fexecve. So this is a > >> > serious problem that needs to be solved if it hasn't already. I have > >> > some ideas I could offer, but I'm not an expert on the kernel side > >> > things so I'm not sure they'd be correct. > >> > >> Bring on the ideas. > > > > My thought is that when the kernel opens the binary and sees that it's > > a script that needs an interpreter, the kernel should not pass > > /proc/self/fd/%d to the interpreter, but instead should pass the name > > of a new magic symlink in /proc/self that's connected to the inode for > > the script to be executed but that ceases to exist as soon as it's > > opened. In theory this could also be used for suid scripts to make > > them secure. > > This doesn't help if /proc is not mounted, which is an important use case. I don't know what can be done in this case short of some really ugly hacks, like giving open() special behavior when the pathname points to a magic address in the argv region, or having the kernel create temp files in some magic path. > >> FWIW, I've often thought that interpreter binaries should mark > >> themselves as such to enable better interactions with the kernel. > > > > That's hard since users expect to be able to use arbitrary > > interpreters (and sometimes even pass through multiple ones, e.g. > > #!/usr/bin/env perl). > > Hmm. I'd be okay with old interpreters having a somewhat degraded experience. > > I guess that #!/some/interpreted/script isn't allowed, but maybe > #!/usr/bin/env some-interpreted-script should work. > > It could be that all that's really needed is some convention to tell > an interpreter that it should use fd N as a script *and close it*. > Something like /dev/fd_and_close/N could work, but that has all kinds > of problems. > > Alternatively, if we could have a way to mark an fd so that it's > close-on-exec after exec, that would solve the nesting problem, as > long as every interpreter in the chain does it. And the kernel could > certainly implement execve on a close-on-exec fd by passing /dev/fd/N > where N is a close-on-exec fd, at least in the non-nested case. This doesn't solve the problem of needing /proc though (/dev/fd is just a link to /proc/self/fd). Rich