From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/4391 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: validation of utf-8 strings passed as system call arguments Date: Thu, 12 Dec 2013 23:39:41 -0500 Message-ID: <20131213043941.GA24286@brightrain.aerifal.cx> References: <20131212213006.dc30d64f61e5ec441c34ffd4f788e58e.381c744cf1.wbe@email22.secureserver.net> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1386909590 19747 80.91.229.3 (13 Dec 2013 04:39:50 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 13 Dec 2013 04:39:50 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-4395-gllmg-musl=m.gmane.org@lists.openwall.com Fri Dec 13 05:39:56 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VrKXc-0007Yp-1y for gllmg-musl@plane.gmane.org; Fri, 13 Dec 2013 05:39:56 +0100 Original-Received: (qmail 27884 invoked by uid 550); 13 Dec 2013 04:39:55 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 27868 invoked from network); 13 Dec 2013 04:39:55 -0000 Content-Disposition: inline In-Reply-To: <20131212213006.dc30d64f61e5ec441c34ffd4f788e58e.381c744cf1.wbe@email22.secureserver.net> User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:4391 Archived-At: On Thu, Dec 12, 2013 at 09:30:06PM -0700, writeonce@midipix.org wrote: > Hello, > > While working on code that converts arguments from utf-16 to utf-8, I > found myself wondering about the "responsibility" for checking > well-formedness of utf-8 strings that are passed to the kernel. As I > suspected, validation of these strings takes place neither in the kernel, > nor in the C library. The attached program demonstrates this by creating > a file named <0xE0 0x9F 0x80>, which according to the Unicode Standard > (6.2, p. 95) is an ill-formed byte sequence. > > I am not sure whether this can officially be considered a bug, and it is > quite clear that fixing this is going to entail some performance penalty. > That being said, after deleting this file from my Ubuntu desktop most (but > not all) attempts to open the Trash folder made Nautilus crash, and it was > only after deleting the file permanently from the shell that order had > been restored... There's nothing in POSIX that says that filenames have to be valid strings in the current locale's encoding -- in fact, this is highly problematic to enforce on implementations other than musl, such as glibc, where the encoding might vary by locale and where different users might be using locales with different encodings. But there's also nothing that says arbitrary byte sequences (excluding of course those containing '/' and NUL) have to be accepted as filenames either. The historical _expectation_ and practice has been that filenames can contain arbitrary byte sequences. And Linus in particular is opposed to changing this, though there's been some indicastion (I don't have references right off) that he might be open to optional restrictions at the kernel level. What's clear to me is that restrictions at the libc level are not useful. If your concern is that creating files with illegal sequences in their names can confuse/break/crash some software, adding a restriction on file creation in libc won't help. A malicious user can just make the syscalls directly to make malicious filenames. On the other hand, having the restriction in libc would be annoying because it would _prevent_ you from renaming or deleting these bad filenames using standard tools; you'd have to use special tools that make the syscalls directly. So if you want protection against illegal sequences in filenames (personally, I want this too) the right place to lobby for it (and propose an optional feature) is in the kernel, not in libc. Rich