From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/4393 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: validation of utf-8 strings passed as system call arguments Date: Fri, 13 Dec 2013 01:49:23 -0500 Message-ID: <20131213064923.GC24286@brightrain.aerifal.cx> References: <20131212213006.dc30d64f61e5ec441c34ffd4f788e58e.381c744cf1.wbe@email22.secureserver.net> <20131213043941.GA24286@brightrain.aerifal.cx> <20131213063651.GH1685@port70.net> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1386917372 26378 80.91.229.3 (13 Dec 2013 06:49:32 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 13 Dec 2013 06:49:32 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-4397-gllmg-musl=m.gmane.org@lists.openwall.com Fri Dec 13 07:49:36 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VrMZ6-0006sa-NI for gllmg-musl@plane.gmane.org; Fri, 13 Dec 2013 07:49:36 +0100 Original-Received: (qmail 15726 invoked by uid 550); 13 Dec 2013 06:49:36 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 15718 invoked from network); 13 Dec 2013 06:49:35 -0000 Content-Disposition: inline In-Reply-To: <20131213063651.GH1685@port70.net> User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:4393 Archived-At: On Fri, Dec 13, 2013 at 07:36:51AM +0100, Szabolcs Nagy wrote: > * Rich Felker [2013-12-12 23:39:41 -0500]: > > that filenames can contain arbitrary byte sequences. And Linus in > > particular is opposed to changing this, though there's been some > > indicastion (I don't have references right off) that he might be open > > to optional restrictions at the kernel level. > > he didnt look very persuadable some time ago > http://yarchive.net/comp/linux/utf8.html Yes, that was a long time ago though. I forget where I saw an indication that this could change (perhaps the Austin Group list? in the thread about newlines...) but the general idea, if I recall, was that restrictions would take place in the framework of a generic layer for restricting malicious content in filenames that's not UTF-8 specific. > (i actually like the kernel that way: what would you do when > mounting a filesystem with invalid filenames? would you also > reject surrogate pairs, pua codes or do unicode normalization?) "Surrogate pairs" aren't even a question; surrogates aren't encodable at all in UTF-8. So they would automatically be gone just by mandating well-formed UTF-8. Normalization (which Apple does) is absolutely wrong and non-conforming to POSIX; it causes multiple distinct names to refer to the same file (despite having a link count of 1, BTW), which is just as dangerous as issues like "over-long sequence" decoding and URL-escaped dots and slashes. The only "correct" way to do normalization at the FS level is disallowing non-normalized filenames. But normalization is actually just broken and harmful anyway, since there are languages for which bugs in Unicode have made the normalized form contrary to the actual semantic ordering of characters in the language (characters were incorrectly assigned combining classes such that letters reorder contrary to their actual semantic order, and due to stability policy this can't be fixed, so the only solution is to forget about using normalization). As for PUA, it wouldn't be forbidden by enforcing UTF-8. Per the definition, a "UTF" is a bijective mapping between the Unicode scalar values (0 through 0xD7FF and 0xE000 through 0x10FFFF) and legal sequences of code units. Whether a character identity is assigned to a scalar value is irrelevant to UTFs. Rich