From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/4399
Path: news.gmane.org!not-for-mail
From: <writeonce@midipix.org>
Newsgroups: gmane.linux.lib.musl.general
Subject: RE: validation of utf-8 strings passed as system call arguments
Date: Fri, 13 Dec 2013 13:23:14 -0700
Message-ID: <20131213132314.dc30d64f61e5ec441c34ffd4f788e58e.76cc1f0026.wbe@email22.secureserver.net>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Trace: ger.gmane.org 1386966204 5524 80.91.229.3 (13 Dec 2013 20:23:24 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Fri, 13 Dec 2013 20:23:24 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-4403-gllmg-musl=m.gmane.org@lists.openwall.com Fri Dec 13 21:23:31 2013
Return-path: <musl-return-4403-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@plane.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-4403-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1VrZGk-0002B8-0m
	for gllmg-musl@plane.gmane.org; Fri, 13 Dec 2013 21:23:30 +0100
Original-Received: (qmail 21973 invoked by uid 550); 13 Dec 2013 20:23:29 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 21965 invoked from network); 13 Dec 2013 20:23:28 -0000
X-SID: 18PF1n0012XSfNk01
X-Originating-IP: 128.143.141.78
User-Agent: Workspace Webmail 5.6.45
Xref: news.gmane.org gmane.linux.lib.musl.general:4399
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/4399>

On 12/13/2013 02:46 PM, Rich Felker wrote:=0A> On Fri, Dec 13, 2013 at 11:5=
7:54AM -0700, writeonce@midipix.org=0A> wrote:=0A>> There's no way to conve=
rt between UTF-8 and UTF-16 without =0A>> parsing/decoding the UTF-8, which=
 includes validating it for free=0A>> if your parser is written properly. F=
ailure to validate would lead=0A>> to all sorts of bugs, many of them dange=
rous, including things like=0A>> treating strings not containing '/', '\', =
':', '.', etc. as if they=0A>> contained those characters, resulting in dir=
ectory escape=0A>> vulnerabilities.=0A>> =0A>> Absolutely, and this is some=
thing that I am checking anyway.  But=0A>> there is also the special case w=
here an ill-formed utf-8 byte=0A>> sequence can still result in a valid cod=
e point, which can then be=0A>> safely converted to utf-16.  These cases, w=
hich are generally known=0A>> as the problem of the "non shortest form," pe=
rtain to byte=0A>> sequences that used to be valid before Unicode version 3=
.1, but are=0A>> now forbidden, hence table 3-7 of the current (6.2) standa=
rd.=0A> =0A> What I was saying is that you don't have this problem if you'r=
e =0A> parsing/decoding UTF-8 correctly. And parsing it correctly is not =
=0A> harder/slower than doing it the way that results in misinterpreting =
=0A> illegal sequences as "non shortest form" for other characters. A=0A> g=
ood treatment of the subject (and near-optimal implementation) is=0A> here:=
=0A> =0A> http://bjoern.hoehrmann.de/utf-8/decoder/dfa/=0A> =0A> My impleme=
ntation in musl is based on the same ideas (UTF-8 decoding =0A> as a state =
machine rather than complex conditionals) but I reduced=0A> the size of the=
 state from two ints to just one and reduced the size=0A> of the state tabl=
e significantly by essentially encoding the=0A> transitions and partial cha=
racter values into the state values.=0A=0AThanks for the tips and reference=
.  Once everything else is working I'll=0Acertainly switch to a method that=
 follows either your, or Hoehrmann's=0Aoptimization (which I'll admittedly =
need more than a few minutes to=0Aunderstand...)  For the time being I am l=
eaving the set of conditionals=0Athat follows the standard and table 3-7, a=
s that is very easy to=0Aimplement.  And with the target strings being rela=
tive shortness,=0Ahopefully this won't even bear any real performance conse=
quences.=0A=0A> =0A> If you're making UTF-8 to UTF-16 conversions to feed t=
o the Windows =0A> kernel filesystem code, I'd do them at the last possible=
 opportunity =0A> before passing the strings to the kernel, and just genera=
te a fake =0A> error equivalent to "file does not exist" or "invalid filena=
me" if=0A> the conversion encounters any illegal sequences.=0A=0AIndeed, th=
at is exactly how I am doing this.=0Azg=0A=0A> =0A> Rich=0A=0A