From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/4398 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: validation of utf-8 strings passed as system call arguments Date: Fri, 13 Dec 2013 14:46:40 -0500 Message-ID: <20131213194640.GE24286@brightrain.aerifal.cx> References: <20131213115754.dc30d64f61e5ec441c34ffd4f788e58e.3729304b11.wbe@email22.secureserver.net> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1386964010 12224 80.91.229.3 (13 Dec 2013 19:46:50 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 13 Dec 2013 19:46:50 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-4402-gllmg-musl=m.gmane.org@lists.openwall.com Fri Dec 13 20:46:55 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VrYhJ-0008BQ-SW for gllmg-musl@plane.gmane.org; Fri, 13 Dec 2013 20:46:53 +0100 Original-Received: (qmail 7183 invoked by uid 550); 13 Dec 2013 19:46:53 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 7173 invoked from network); 13 Dec 2013 19:46:52 -0000 Content-Disposition: inline In-Reply-To: <20131213115754.dc30d64f61e5ec441c34ffd4f788e58e.3729304b11.wbe@email22.secureserver.net> User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:4398 Archived-At: On Fri, Dec 13, 2013 at 11:57:54AM -0700, writeonce@midipix.org wrote: > There's no way to convert between UTF-8 and UTF-16 without > parsing/decoding the UTF-8, which includes validating it for free if > your parser is written properly. Failure to validate would lead to all > sorts of bugs, many of them dangerous, including things like treating > strings not containing '/', '\', ':', '.', etc. as if they contained > those characters, resulting in directory escape vulnerabilities. > > Absolutely, and this is something that I am checking anyway. But there is > also the special case where an ill-formed utf-8 byte sequence can still > result in a valid code point, which can then be safely converted to > utf-16. These cases, which are generally known as the problem of the "non > shortest form," pertain to byte sequences that used to be valid before > Unicode version 3.1, but are now forbidden, hence table 3-7 of the current > (6.2) standard. What I was saying is that you don't have this problem if you're parsing/decoding UTF-8 correctly. And parsing it correctly is not harder/slower than doing it the way that results in misinterpreting illegal sequences as "non shortest form" for other characters. A good treatment of the subject (and near-optimal implementation) is here: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ My implementation in musl is based on the same ideas (UTF-8 decoding as a state machine rather than complex conditionals) but I reduced the size of the state from two ints to just one and reduced the size of the state table significantly by essentially encoding the transitions and partial character values into the state values. If you're making UTF-8 to UTF-16 conversions to feed to the Windows kernel filesystem code, I'd do them at the last possible opportunity before passing the strings to the kernel, and just generate a fake error equivalent to "file does not exist" or "invalid filename" if the conversion encounters any illegal sequences. Rich