From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/4397 Path: news.gmane.org!not-for-mail From: Newsgroups: gmane.linux.lib.musl.general Subject: RE: validation of utf-8 strings passed as system call arguments Date: Fri, 13 Dec 2013 11:57:54 -0700 Message-ID: <20131213115754.dc30d64f61e5ec441c34ffd4f788e58e.3729304b11.wbe@email22.secureserver.net> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1386961086 11578 80.91.229.3 (13 Dec 2013 18:58:06 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 13 Dec 2013 18:58:06 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-4401-gllmg-musl=m.gmane.org@lists.openwall.com Fri Dec 13 19:58:12 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VrXw9-0000XV-RX for gllmg-musl@plane.gmane.org; Fri, 13 Dec 2013 19:58:09 +0100 Original-Received: (qmail 18291 invoked by uid 550); 13 Dec 2013 18:58:08 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 18283 invoked from network); 13 Dec 2013 18:58:07 -0000 X-SID: 16xv1n0012YkKj001 X-Originating-IP: 128.143.141.78 User-Agent: Workspace Webmail 5.6.45 Xref: news.gmane.org gmane.linux.lib.musl.general:4397 Archived-At:
On 12/13/2013 12:28 PM, Rich Felker wro= te:
=0A
=0A
=0A On Fri, Dec 13, 2013 at 0= 5:52:35AM -0700, writeonce@midipix= .org wrote:=0A=0A
   As=
 always, you are absolutely right:-)  but my situation is slightly=0A   dif=
ferent, though; the input I receive is expected to be in utf-8, but the=0A =
  nt kernel only accepts utf-16.  This means that I need to choose between=
=0A   conversion that is based on bit distribution only, which might  produ=
ce=0A   ill-formed utf-16 byte sequences, or do all the validation on my en=
d=0A   despite the minor performance penalty.  Since path strings are norma=
lly=0A   only a few hundred bytes long, and given that the nt kernel cannot=
 be=0A   (easily) debugged from my end, I'm leaning towards the latter opti=
on.=0A
=0A
There's no way to convert betw=
een UTF-8 and UTF-16 without=0Aparsing/decoding the UTF-8, which includes v=
alidating it for free if=0Ayour parser is written properly. Failure to vali=
date would lead to all=0Asorts of bugs, many of them dangerous, including t=
hings like treating=0Astrings not containing '/', '\', ':', '.', etc. as if=
 they contained=0Athose characters, resulting in directory escape vulnerabi=
lities.=0A
=0A

Absolutely, and this is something that I= am checking anyway.  But =0Athere is also the special case where an i= ll-formed utf-8 byte sequence =0Acan still result in a valid code point, wh= ich can then be safely converted to utf-16.  These cases, which are ge= nerally known as the problem of =0Athe "non shortest form," pertain to byte= sequences that used to be valid =0Abefore Unicode version 3.1, but are now= forbidden, hence =0Atable 3-7 of the current (6.2) standard.
=0A
=0A= zg
=0A
=0A

=0A Rich=0A=0A=0A=0A=

=0A
=0A