From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/4399 Path: news.gmane.org!not-for-mail From: Newsgroups: gmane.linux.lib.musl.general Subject: RE: validation of utf-8 strings passed as system call arguments Date: Fri, 13 Dec 2013 13:23:14 -0700 Message-ID: <20131213132314.dc30d64f61e5ec441c34ffd4f788e58e.76cc1f0026.wbe@email22.secureserver.net> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1386966204 5524 80.91.229.3 (13 Dec 2013 20:23:24 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 13 Dec 2013 20:23:24 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-4403-gllmg-musl=m.gmane.org@lists.openwall.com Fri Dec 13 21:23:31 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VrZGk-0002B8-0m for gllmg-musl@plane.gmane.org; Fri, 13 Dec 2013 21:23:30 +0100 Original-Received: (qmail 21973 invoked by uid 550); 13 Dec 2013 20:23:29 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 21965 invoked from network); 13 Dec 2013 20:23:28 -0000 X-SID: 18PF1n0012XSfNk01 X-Originating-IP: 128.143.141.78 User-Agent: Workspace Webmail 5.6.45 Xref: news.gmane.org gmane.linux.lib.musl.general:4399 Archived-At: On 12/13/2013 02:46 PM, Rich Felker wrote:=0A> On Fri, Dec 13, 2013 at 11:5= 7:54AM -0700, writeonce@midipix.org=0A> wrote:=0A>> There's no way to conve= rt between UTF-8 and UTF-16 without =0A>> parsing/decoding the UTF-8, which= includes validating it for free=0A>> if your parser is written properly. F= ailure to validate would lead=0A>> to all sorts of bugs, many of them dange= rous, including things like=0A>> treating strings not containing '/', '\', = ':', '.', etc. as if they=0A>> contained those characters, resulting in dir= ectory escape=0A>> vulnerabilities.=0A>> =0A>> Absolutely, and this is some= thing that I am checking anyway. But=0A>> there is also the special case w= here an ill-formed utf-8 byte=0A>> sequence can still result in a valid cod= e point, which can then be=0A>> safely converted to utf-16. These cases, w= hich are generally known=0A>> as the problem of the "non shortest form," pe= rtain to byte=0A>> sequences that used to be valid before Unicode version 3= .1, but are=0A>> now forbidden, hence table 3-7 of the current (6.2) standa= rd.=0A> =0A> What I was saying is that you don't have this problem if you'r= e =0A> parsing/decoding UTF-8 correctly. And parsing it correctly is not = =0A> harder/slower than doing it the way that results in misinterpreting = =0A> illegal sequences as "non shortest form" for other characters. A=0A> g= ood treatment of the subject (and near-optimal implementation) is=0A> here:= =0A> =0A> http://bjoern.hoehrmann.de/utf-8/decoder/dfa/=0A> =0A> My impleme= ntation in musl is based on the same ideas (UTF-8 decoding =0A> as a state = machine rather than complex conditionals) but I reduced=0A> the size of the= state from two ints to just one and reduced the size=0A> of the state tabl= e significantly by essentially encoding the=0A> transitions and partial cha= racter values into the state values.=0A=0AThanks for the tips and reference= . Once everything else is working I'll=0Acertainly switch to a method that= follows either your, or Hoehrmann's=0Aoptimization (which I'll admittedly = need more than a few minutes to=0Aunderstand...) For the time being I am l= eaving the set of conditionals=0Athat follows the standard and table 3-7, a= s that is very easy to=0Aimplement. And with the target strings being rela= tive shortness,=0Ahopefully this won't even bear any real performance conse= quences.=0A=0A> =0A> If you're making UTF-8 to UTF-16 conversions to feed t= o the Windows =0A> kernel filesystem code, I'd do them at the last possible= opportunity =0A> before passing the strings to the kernel, and just genera= te a fake =0A> error equivalent to "file does not exist" or "invalid filena= me" if=0A> the conversion encounters any illegal sequences.=0A=0AIndeed, th= at is exactly how I am doing this.=0Azg=0A=0A> =0A> Rich=0A=0A