From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/4396 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: validation of utf-8 strings passed as system call arguments Date: Fri, 13 Dec 2013 12:28:07 -0500 Message-ID: <20131213172807.GD24286@brightrain.aerifal.cx> References: <20131213055235.dc30d64f61e5ec441c34ffd4f788e58e.563168f7a8.wbe@email22.secureserver.net> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1386955698 12271 80.91.229.3 (13 Dec 2013 17:28:18 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 13 Dec 2013 17:28:18 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-4400-gllmg-musl=m.gmane.org@lists.openwall.com Fri Dec 13 18:28:25 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VrWXH-0006kS-Pm for gllmg-musl@plane.gmane.org; Fri, 13 Dec 2013 18:28:23 +0100 Original-Received: (qmail 5122 invoked by uid 550); 13 Dec 2013 17:28:23 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 4069 invoked from network); 13 Dec 2013 17:28:20 -0000 Content-Disposition: inline In-Reply-To: <20131213055235.dc30d64f61e5ec441c34ffd4f788e58e.563168f7a8.wbe@email22.secureserver.net> User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:4396 Archived-At: On Fri, Dec 13, 2013 at 05:52:35AM -0700, writeonce@midipix.org wrote: > As always, you are absolutely right:-) but my situation is slightly > different, though; the input I receive is expected to be in utf-8, but the > nt kernel only accepts utf-16. This means that I need to choose between > conversion that is based on bit distribution only, which might produce > ill-formed utf-16 byte sequences, or do all the validation on my end > despite the minor performance penalty. Since path strings are normally > only a few hundred bytes long, and given that the nt kernel cannot be > (easily) debugged from my end, I'm leaning towards the latter option. There's no way to convert between UTF-8 and UTF-16 without parsing/decoding the UTF-8, which includes validating it for free if your parser is written properly. Failure to validate would lead to all sorts of bugs, many of them dangerous, including things like treating strings not containing '/', '\', ':', '.', etc. as if they contained those characters, resulting in directory escape vulnerabilities. Rich