From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/4390 Path: news.gmane.org!not-for-mail From: Newsgroups: gmane.linux.lib.musl.general Subject: validation of utf-8 strings passed as system call arguments Date: Thu, 12 Dec 2013 21:30:06 -0700 Message-ID: <20131212213006.dc30d64f61e5ec441c34ffd4f788e58e.381c744cf1.wbe@email22.secureserver.net> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=_e3df83218951a98b726dc12ed4666ef2" X-Trace: ger.gmane.org 1386909025 15045 80.91.229.3 (13 Dec 2013 04:30:25 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 13 Dec 2013 04:30:25 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-4394-gllmg-musl=m.gmane.org@lists.openwall.com Fri Dec 13 05:30:29 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VrKON-0000pA-R9 for gllmg-musl@plane.gmane.org; Fri, 13 Dec 2013 05:30:23 +0100 Original-Received: (qmail 23672 invoked by uid 550); 13 Dec 2013 04:30:22 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 23663 invoked from network); 13 Dec 2013 04:30:21 -0000 X-SID: 0sW71n0012XSfNk01 X-Originating-IP: 71.206.170.124 User-Agent: Workspace Webmail 5.6.45 Xref: news.gmane.org gmane.linux.lib.musl.general:4390 Archived-At: --=_e3df83218951a98b726dc12ed4666ef2 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="utf-8"
Hello,

While working on code that converts arguments from = utf-16 to utf-8, I found myself wondering about the "responsibility" for ch= ecking well-formedness of utf-8 strings that are passed to the kernel. = ; As I suspected, validation of these strings takes place neither in the ke= rnel, nor in the C library.  The attached program demonstrates this by= creating a file named <0xE0 0x9F 0x80>, which according to the Unico= de Standard (6.2, p. 95) is an ill-formed byte sequence.

I am not su= re whether this can officially be considered a bug, and it is quite clear t= hat fixing this is going to entail some performance penalty.  That bei= ng said, after deleting this file from my Ubuntu desktop most (but not all)= attempts to open the Trash folder made Nautilus crash, and it was only aft= er deleting the file permanently from the shell that order had been restore= d...

Best regards,
zg

--=_e3df83218951a98b726dc12ed4666ef2 Content-Transfer-Encoding: base64 Content-Type: text/x-c; name="open__ill_formed_utf8.c"; Content-Disposition: attachment; filename="open__ill_formed_utf8.c"; I2luY2x1ZGUgPHN0ZGlvLmg+CiNpbmNsdWRlIDxmY250bC5oPgojaW5jbHVkZSA8dW5pc3RkLmg+ CiNpbmNsdWRlIDxzeXMvc3RhdC5oPgojaW5jbHVkZSA8c3lzL3R5cGVzLmg+CgppbnQgbWFpbiAo aW50IGFyZ2MsIGNoYXIgKiBhcmd2W10sIGNoYXIgKiBlbnZwW10pCnsKCWNoYXIgcGF0aFtdID0g ezB4RTAsIDB4OUYsIDB4ODAsIDB4MDB9OwoJbW9kZV90IG1vZGUgPSBTX0lSVVNSIHwgU19JV1VT UiB8IFNfSVJHUlAgfCBTX0lXR1JQIHwgU19JUk9USDsKCglpbnQgZmQgPSBvcGVuIChwYXRoLCBP X1dST05MWSB8IE9fRVhDTCB8IE9fQ1JFQVQsIG1vZGUpOwoJCglpZiAoZmQgPT0gLTEpIHsKCQlw ZXJyb3IgKCJvcGVuIik7CgkJcmV0dXJuIDI7Cgl9IGVsc2UgewoJCXByaW50ZigiSXQgd29ya2Vk ISBUaGUgZmlsZSBkZXNjcmlwdG9yIGlzICVkLlxuIixmZCk7Cgl9CgkKCXJldHVybiAwOwp9Cgo= --=_e3df83218951a98b726dc12ed4666ef2-- From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/4391 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: validation of utf-8 strings passed as system call arguments Date: Thu, 12 Dec 2013 23:39:41 -0500 Message-ID: <20131213043941.GA24286@brightrain.aerifal.cx> References: <20131212213006.dc30d64f61e5ec441c34ffd4f788e58e.381c744cf1.wbe@email22.secureserver.net> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1386909590 19747 80.91.229.3 (13 Dec 2013 04:39:50 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 13 Dec 2013 04:39:50 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-4395-gllmg-musl=m.gmane.org@lists.openwall.com Fri Dec 13 05:39:56 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VrKXc-0007Yp-1y for gllmg-musl@plane.gmane.org; Fri, 13 Dec 2013 05:39:56 +0100 Original-Received: (qmail 27884 invoked by uid 550); 13 Dec 2013 04:39:55 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 27868 invoked from network); 13 Dec 2013 04:39:55 -0000 Content-Disposition: inline In-Reply-To: <20131212213006.dc30d64f61e5ec441c34ffd4f788e58e.381c744cf1.wbe@email22.secureserver.net> User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:4391 Archived-At: On Thu, Dec 12, 2013 at 09:30:06PM -0700, writeonce@midipix.org wrote: > Hello, > > While working on code that converts arguments from utf-16 to utf-8, I > found myself wondering about the "responsibility" for checking > well-formedness of utf-8 strings that are passed to the kernel. As I > suspected, validation of these strings takes place neither in the kernel, > nor in the C library. The attached program demonstrates this by creating > a file named <0xE0 0x9F 0x80>, which according to the Unicode Standard > (6.2, p. 95) is an ill-formed byte sequence. > > I am not sure whether this can officially be considered a bug, and it is > quite clear that fixing this is going to entail some performance penalty. > That being said, after deleting this file from my Ubuntu desktop most (but > not all) attempts to open the Trash folder made Nautilus crash, and it was > only after deleting the file permanently from the shell that order had > been restored... There's nothing in POSIX that says that filenames have to be valid strings in the current locale's encoding -- in fact, this is highly problematic to enforce on implementations other than musl, such as glibc, where the encoding might vary by locale and where different users might be using locales with different encodings. But there's also nothing that says arbitrary byte sequences (excluding of course those containing '/' and NUL) have to be accepted as filenames either. The historical _expectation_ and practice has been that filenames can contain arbitrary byte sequences. And Linus in particular is opposed to changing this, though there's been some indicastion (I don't have references right off) that he might be open to optional restrictions at the kernel level. What's clear to me is that restrictions at the libc level are not useful. If your concern is that creating files with illegal sequences in their names can confuse/break/crash some software, adding a restriction on file creation in libc won't help. A malicious user can just make the syscalls directly to make malicious filenames. On the other hand, having the restriction in libc would be annoying because it would _prevent_ you from renaming or deleting these bad filenames using standard tools; you'd have to use special tools that make the syscalls directly. So if you want protection against illegal sequences in filenames (personally, I want this too) the right place to lobby for it (and propose an optional feature) is in the kernel, not in libc. Rich From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/4392 Path: news.gmane.org!not-for-mail From: Szabolcs Nagy Newsgroups: gmane.linux.lib.musl.general Subject: Re: validation of utf-8 strings passed as system call arguments Date: Fri, 13 Dec 2013 07:36:51 +0100 Message-ID: <20131213063651.GH1685@port70.net> References: <20131212213006.dc30d64f61e5ec441c34ffd4f788e58e.381c744cf1.wbe@email22.secureserver.net> <20131213043941.GA24286@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1386916617 19029 80.91.229.3 (13 Dec 2013 06:36:57 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 13 Dec 2013 06:36:57 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-4396-gllmg-musl=m.gmane.org@lists.openwall.com Fri Dec 13 07:37:04 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VrMMy-0006O6-AC for gllmg-musl@plane.gmane.org; Fri, 13 Dec 2013 07:37:04 +0100 Original-Received: (qmail 9535 invoked by uid 550); 13 Dec 2013 06:37:03 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 9527 invoked from network); 13 Dec 2013 06:37:03 -0000 Content-Disposition: inline In-Reply-To: <20131213043941.GA24286@brightrain.aerifal.cx> User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:4392 Archived-At: * Rich Felker [2013-12-12 23:39:41 -0500]: > that filenames can contain arbitrary byte sequences. And Linus in > particular is opposed to changing this, though there's been some > indicastion (I don't have references right off) that he might be open > to optional restrictions at the kernel level. he didnt look very persuadable some time ago http://yarchive.net/comp/linux/utf8.html (i actually like the kernel that way: what would you do when mounting a filesystem with invalid filenames? would you also reject surrogate pairs, pua codes or do unicode normalization?) From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/4393 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: validation of utf-8 strings passed as system call arguments Date: Fri, 13 Dec 2013 01:49:23 -0500 Message-ID: <20131213064923.GC24286@brightrain.aerifal.cx> References: <20131212213006.dc30d64f61e5ec441c34ffd4f788e58e.381c744cf1.wbe@email22.secureserver.net> <20131213043941.GA24286@brightrain.aerifal.cx> <20131213063651.GH1685@port70.net> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1386917372 26378 80.91.229.3 (13 Dec 2013 06:49:32 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 13 Dec 2013 06:49:32 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-4397-gllmg-musl=m.gmane.org@lists.openwall.com Fri Dec 13 07:49:36 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VrMZ6-0006sa-NI for gllmg-musl@plane.gmane.org; Fri, 13 Dec 2013 07:49:36 +0100 Original-Received: (qmail 15726 invoked by uid 550); 13 Dec 2013 06:49:36 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 15718 invoked from network); 13 Dec 2013 06:49:35 -0000 Content-Disposition: inline In-Reply-To: <20131213063651.GH1685@port70.net> User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:4393 Archived-At: On Fri, Dec 13, 2013 at 07:36:51AM +0100, Szabolcs Nagy wrote: > * Rich Felker [2013-12-12 23:39:41 -0500]: > > that filenames can contain arbitrary byte sequences. And Linus in > > particular is opposed to changing this, though there's been some > > indicastion (I don't have references right off) that he might be open > > to optional restrictions at the kernel level. > > he didnt look very persuadable some time ago > http://yarchive.net/comp/linux/utf8.html Yes, that was a long time ago though. I forget where I saw an indication that this could change (perhaps the Austin Group list? in the thread about newlines...) but the general idea, if I recall, was that restrictions would take place in the framework of a generic layer for restricting malicious content in filenames that's not UTF-8 specific. > (i actually like the kernel that way: what would you do when > mounting a filesystem with invalid filenames? would you also > reject surrogate pairs, pua codes or do unicode normalization?) "Surrogate pairs" aren't even a question; surrogates aren't encodable at all in UTF-8. So they would automatically be gone just by mandating well-formed UTF-8. Normalization (which Apple does) is absolutely wrong and non-conforming to POSIX; it causes multiple distinct names to refer to the same file (despite having a link count of 1, BTW), which is just as dangerous as issues like "over-long sequence" decoding and URL-escaped dots and slashes. The only "correct" way to do normalization at the FS level is disallowing non-normalized filenames. But normalization is actually just broken and harmful anyway, since there are languages for which bugs in Unicode have made the normalized form contrary to the actual semantic ordering of characters in the language (characters were incorrectly assigned combining classes such that letters reorder contrary to their actual semantic order, and due to stability policy this can't be fixed, so the only solution is to forget about using normalization). As for PUA, it wouldn't be forbidden by enforcing UTF-8. Per the definition, a "UTF" is a bijective mapping between the Unicode scalar values (0 through 0xD7FF and 0xE000 through 0x10FFFF) and legal sequences of code units. Whether a character identity is assigned to a scalar value is irrelevant to UTFs. Rich From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/4394 Path: news.gmane.org!not-for-mail From: Luca Barbato Newsgroups: gmane.linux.lib.musl.general Subject: Re: validation of utf-8 strings passed as system call arguments Date: Fri, 13 Dec 2013 13:11:38 +0100 Message-ID: <52AAF97A.1090505@gentoo.org> References: <20131212213006.dc30d64f61e5ec441c34ffd4f788e58e.381c744cf1.wbe@email22.secureserver.net> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1386936706 9859 80.91.229.3 (13 Dec 2013 12:11:46 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 13 Dec 2013 12:11:46 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-4398-gllmg-musl=m.gmane.org@lists.openwall.com Fri Dec 13 13:11:50 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VrRau-00056z-VX for gllmg-musl@plane.gmane.org; Fri, 13 Dec 2013 13:11:49 +0100 Original-Received: (qmail 3110 invoked by uid 550); 13 Dec 2013 12:11:46 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 3102 invoked from network); 13 Dec 2013 12:11:46 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.1 In-Reply-To: <20131212213006.dc30d64f61e5ec441c34ffd4f788e58e.381c744cf1.wbe@email22.secureserver.net> Xref: news.gmane.org gmane.linux.lib.musl.general:4394 Archived-At: On 13/12/13 05:30, writeonce@midipix.org wrote: > Hello, > > While working on code that converts arguments from utf-16 to utf-8, I found > myself wondering about the "responsibility" for checking well-formedness of > utf-8 strings that are passed to the kernel. As I suspected, validation of > these strings takes place neither in the kernel, nor in the C library. The > attached program demonstrates this by creating a file named <0xE0 0x9F 0x80>, > which according to the Unicode Standard (6.2, p. 95) is an ill-formed byte sequence. > > I am not sure whether this can officially be considered a bug, and it is quite > clear that fixing this is going to entail some performance penalty. That being > said, after deleting this file from my Ubuntu desktop most (but not all) > attempts to open the Trash folder made Nautilus crash, and it was only after > deleting the file permanently from the shell that order had been restored... > any kind of rejection beside null and separator seems to me that would be more harmful and even more dangerous than the status quo. lu From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/4395 Path: news.gmane.org!not-for-mail From: Newsgroups: gmane.linux.lib.musl.general Subject: RE: validation of utf-8 strings passed as system call arguments Date: Fri, 13 Dec 2013 05:52:35 -0700 Message-ID: <20131213055235.dc30d64f61e5ec441c34ffd4f788e58e.563168f7a8.wbe@email22.secureserver.net> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1386939167 6156 80.91.229.3 (13 Dec 2013 12:52:47 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 13 Dec 2013 12:52:47 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-4399-gllmg-musl=m.gmane.org@lists.openwall.com Fri Dec 13 13:52:53 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VrSEf-0001a9-1O for gllmg-musl@plane.gmane.org; Fri, 13 Dec 2013 13:52:53 +0100 Original-Received: (qmail 28067 invoked by uid 550); 13 Dec 2013 12:52:52 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 28059 invoked from network); 13 Dec 2013 12:52:51 -0000 X-SID: 10sd1n0012YkKj001 X-Originating-IP: 71.206.170.124 User-Agent: Workspace Webmail 5.6.45 Xref: news.gmane.org gmane.linux.lib.musl.general:4395 Archived-At:
On 12/12/2013 11:39 PM, Rich Felker wro= te:
=0A
=0A
=0A On Thu, Dec 12, 2013 at 0= 9:30:06PM -0700, writeonce@midipix= .org wrote:=0A=0A
   He=
llo,=0A=0A   While working on code that converts arguments from utf-16 to u=
tf-8, I=0A   found myself wondering about the "responsibility" for checking=
=0A   well-formedness of utf-8 strings that are passed to the kernel.  As I=
=0A   suspected, validation of these strings takes place neither in the ker=
nel,=0A   nor in the C library.  The attached program demonstrates this by =
creating=0A   a file named <0xE0 0x9F 0x80>, which according to the U=
nicode Standard=0A   (6.2, p. 95) is an ill-formed byte sequence.=0A=0A   I=
 am not sure whether this can officially be considered a bug, and it is=0A =
  quite clear that fixing this is going to entail some performance penalty.=
 =0A   That being said, after deleting this file from my Ubuntu desktop mos=
t (but=0A   not all) attempts to open the Trash folder made Nautilus crash,=
 and it was=0A   only after deleting the file permanently from the shell th=
at order had=0A   been restored...=0A
=0A
There's nothing in POSIX that says that filenames have to be valid=0Astrin=
gs in the current locale's encoding -- in fact, this is highly=0Aproblemati=
c to enforce on implementations other than musl, such as=0Aglibc, where the=
 encoding might vary by locale and where different=0Ausers might be using l=
ocales with different encodings.=0A=0ABut there's also nothing that says ar=
bitrary byte sequences (excluding=0Aof course those containing '/' and NUL)=
 have to be accepted as=0Afilenames either. The historical _expectation_ an=
d practice has been=0Athat filenames can contain arbitrary byte sequences. =
And Linus in=0Aparticular is opposed to changing this, though there's been =
some=0Aindicastion (I don't have references right off) that he might be ope=
n=0Ato optional restrictions at the kernel level.=0A=0AWhat's clear to me i=
s that restrictions at the libc level are not=0Auseful. If your concern is =
that creating files with illegal sequences=0Ain their names can confuse/bre=
ak/crash some software, adding a=0Arestriction on file creation in libc won=
't help. A malicious user can=0Ajust make the syscalls directly to make mal=
icious filenames. On the=0Aother hand, having the restriction in libc would=
 be annoying because=0Ait would _prevent_ you from renaming or deleting the=
se bad filenames=0Ausing standard tools; you'd have to use special tools th=
at make the=0Asyscalls directly.
=0A

=0AAs always, you = are absolutely right:-)  but my situation is slightly =0Adifferent, th= ough; the input I receive is expected to be in utf-8, but =0Athe nt kernel = only accepts utf-16.  This means that I need to choose =0Abetween conv= ersion that is based on bit distribution only, which might  produce il= l-formed utf-16 byte sequences, or do all the =0Avalidation on my end despi= te the minor performance penalty.  Since path =0Astrings are normally = only a few hundred bytes long, and given that the=0A nt kernel cannot be (e= asily) debugged from my end, I'm leaning towards =0Athe latter option.
= =0A
=0A

=0A =0ASo if you want protection= against illegal sequences in filenames=0A(personally, I want this too) the= right place to lobby for it (and=0Apropose an optional feature) is in the = kernel, not in libc.=0A=0ARich=0A=0A=0A=0A

=0A
=0A

From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/4396 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: validation of utf-8 strings passed as system call arguments Date: Fri, 13 Dec 2013 12:28:07 -0500 Message-ID: <20131213172807.GD24286@brightrain.aerifal.cx> References: <20131213055235.dc30d64f61e5ec441c34ffd4f788e58e.563168f7a8.wbe@email22.secureserver.net> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1386955698 12271 80.91.229.3 (13 Dec 2013 17:28:18 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 13 Dec 2013 17:28:18 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-4400-gllmg-musl=m.gmane.org@lists.openwall.com Fri Dec 13 18:28:25 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VrWXH-0006kS-Pm for gllmg-musl@plane.gmane.org; Fri, 13 Dec 2013 18:28:23 +0100 Original-Received: (qmail 5122 invoked by uid 550); 13 Dec 2013 17:28:23 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 4069 invoked from network); 13 Dec 2013 17:28:20 -0000 Content-Disposition: inline In-Reply-To: <20131213055235.dc30d64f61e5ec441c34ffd4f788e58e.563168f7a8.wbe@email22.secureserver.net> User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:4396 Archived-At: On Fri, Dec 13, 2013 at 05:52:35AM -0700, writeonce@midipix.org wrote: > As always, you are absolutely right:-) but my situation is slightly > different, though; the input I receive is expected to be in utf-8, but the > nt kernel only accepts utf-16. This means that I need to choose between > conversion that is based on bit distribution only, which might produce > ill-formed utf-16 byte sequences, or do all the validation on my end > despite the minor performance penalty. Since path strings are normally > only a few hundred bytes long, and given that the nt kernel cannot be > (easily) debugged from my end, I'm leaning towards the latter option. There's no way to convert between UTF-8 and UTF-16 without parsing/decoding the UTF-8, which includes validating it for free if your parser is written properly. Failure to validate would lead to all sorts of bugs, many of them dangerous, including things like treating strings not containing '/', '\', ':', '.', etc. as if they contained those characters, resulting in directory escape vulnerabilities. Rich From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/4397 Path: news.gmane.org!not-for-mail From: Newsgroups: gmane.linux.lib.musl.general Subject: RE: validation of utf-8 strings passed as system call arguments Date: Fri, 13 Dec 2013 11:57:54 -0700 Message-ID: <20131213115754.dc30d64f61e5ec441c34ffd4f788e58e.3729304b11.wbe@email22.secureserver.net> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1386961086 11578 80.91.229.3 (13 Dec 2013 18:58:06 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 13 Dec 2013 18:58:06 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-4401-gllmg-musl=m.gmane.org@lists.openwall.com Fri Dec 13 19:58:12 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VrXw9-0000XV-RX for gllmg-musl@plane.gmane.org; Fri, 13 Dec 2013 19:58:09 +0100 Original-Received: (qmail 18291 invoked by uid 550); 13 Dec 2013 18:58:08 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 18283 invoked from network); 13 Dec 2013 18:58:07 -0000 X-SID: 16xv1n0012YkKj001 X-Originating-IP: 128.143.141.78 User-Agent: Workspace Webmail 5.6.45 Xref: news.gmane.org gmane.linux.lib.musl.general:4397 Archived-At:
On 12/13/2013 12:28 PM, Rich Felker wro= te:
=0A
=0A
=0A On Fri, Dec 13, 2013 at 0= 5:52:35AM -0700, writeonce@midipix= .org wrote:=0A=0A
   As=
 always, you are absolutely right:-)  but my situation is slightly=0A   dif=
ferent, though; the input I receive is expected to be in utf-8, but the=0A =
  nt kernel only accepts utf-16.  This means that I need to choose between=
=0A   conversion that is based on bit distribution only, which might  produ=
ce=0A   ill-formed utf-16 byte sequences, or do all the validation on my en=
d=0A   despite the minor performance penalty.  Since path strings are norma=
lly=0A   only a few hundred bytes long, and given that the nt kernel cannot=
 be=0A   (easily) debugged from my end, I'm leaning towards the latter opti=
on.=0A
=0A
There's no way to convert betw=
een UTF-8 and UTF-16 without=0Aparsing/decoding the UTF-8, which includes v=
alidating it for free if=0Ayour parser is written properly. Failure to vali=
date would lead to all=0Asorts of bugs, many of them dangerous, including t=
hings like treating=0Astrings not containing '/', '\', ':', '.', etc. as if=
 they contained=0Athose characters, resulting in directory escape vulnerabi=
lities.=0A
=0A

Absolutely, and this is something that I= am checking anyway.  But =0Athere is also the special case where an i= ll-formed utf-8 byte sequence =0Acan still result in a valid code point, wh= ich can then be safely converted to utf-16.  These cases, which are ge= nerally known as the problem of =0Athe "non shortest form," pertain to byte= sequences that used to be valid =0Abefore Unicode version 3.1, but are now= forbidden, hence =0Atable 3-7 of the current (6.2) standard.
=0A
=0A= zg
=0A
=0A

=0A Rich=0A=0A=0A=0A=

=0A
=0A

From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/4398 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: validation of utf-8 strings passed as system call arguments Date: Fri, 13 Dec 2013 14:46:40 -0500 Message-ID: <20131213194640.GE24286@brightrain.aerifal.cx> References: <20131213115754.dc30d64f61e5ec441c34ffd4f788e58e.3729304b11.wbe@email22.secureserver.net> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1386964010 12224 80.91.229.3 (13 Dec 2013 19:46:50 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 13 Dec 2013 19:46:50 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-4402-gllmg-musl=m.gmane.org@lists.openwall.com Fri Dec 13 20:46:55 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VrYhJ-0008BQ-SW for gllmg-musl@plane.gmane.org; Fri, 13 Dec 2013 20:46:53 +0100 Original-Received: (qmail 7183 invoked by uid 550); 13 Dec 2013 19:46:53 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 7173 invoked from network); 13 Dec 2013 19:46:52 -0000 Content-Disposition: inline In-Reply-To: <20131213115754.dc30d64f61e5ec441c34ffd4f788e58e.3729304b11.wbe@email22.secureserver.net> User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:4398 Archived-At: On Fri, Dec 13, 2013 at 11:57:54AM -0700, writeonce@midipix.org wrote: > There's no way to convert between UTF-8 and UTF-16 without > parsing/decoding the UTF-8, which includes validating it for free if > your parser is written properly. Failure to validate would lead to all > sorts of bugs, many of them dangerous, including things like treating > strings not containing '/', '\', ':', '.', etc. as if they contained > those characters, resulting in directory escape vulnerabilities. > > Absolutely, and this is something that I am checking anyway. But there is > also the special case where an ill-formed utf-8 byte sequence can still > result in a valid code point, which can then be safely converted to > utf-16. These cases, which are generally known as the problem of the "non > shortest form," pertain to byte sequences that used to be valid before > Unicode version 3.1, but are now forbidden, hence table 3-7 of the current > (6.2) standard. What I was saying is that you don't have this problem if you're parsing/decoding UTF-8 correctly. And parsing it correctly is not harder/slower than doing it the way that results in misinterpreting illegal sequences as "non shortest form" for other characters. A good treatment of the subject (and near-optimal implementation) is here: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ My implementation in musl is based on the same ideas (UTF-8 decoding as a state machine rather than complex conditionals) but I reduced the size of the state from two ints to just one and reduced the size of the state table significantly by essentially encoding the transitions and partial character values into the state values. If you're making UTF-8 to UTF-16 conversions to feed to the Windows kernel filesystem code, I'd do them at the last possible opportunity before passing the strings to the kernel, and just generate a fake error equivalent to "file does not exist" or "invalid filename" if the conversion encounters any illegal sequences. Rich From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/4399 Path: news.gmane.org!not-for-mail From: Newsgroups: gmane.linux.lib.musl.general Subject: RE: validation of utf-8 strings passed as system call arguments Date: Fri, 13 Dec 2013 13:23:14 -0700 Message-ID: <20131213132314.dc30d64f61e5ec441c34ffd4f788e58e.76cc1f0026.wbe@email22.secureserver.net> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1386966204 5524 80.91.229.3 (13 Dec 2013 20:23:24 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 13 Dec 2013 20:23:24 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-4403-gllmg-musl=m.gmane.org@lists.openwall.com Fri Dec 13 21:23:31 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VrZGk-0002B8-0m for gllmg-musl@plane.gmane.org; Fri, 13 Dec 2013 21:23:30 +0100 Original-Received: (qmail 21973 invoked by uid 550); 13 Dec 2013 20:23:29 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 21965 invoked from network); 13 Dec 2013 20:23:28 -0000 X-SID: 18PF1n0012XSfNk01 X-Originating-IP: 128.143.141.78 User-Agent: Workspace Webmail 5.6.45 Xref: news.gmane.org gmane.linux.lib.musl.general:4399 Archived-At: On 12/13/2013 02:46 PM, Rich Felker wrote:=0A> On Fri, Dec 13, 2013 at 11:5= 7:54AM -0700, writeonce@midipix.org=0A> wrote:=0A>> There's no way to conve= rt between UTF-8 and UTF-16 without =0A>> parsing/decoding the UTF-8, which= includes validating it for free=0A>> if your parser is written properly. F= ailure to validate would lead=0A>> to all sorts of bugs, many of them dange= rous, including things like=0A>> treating strings not containing '/', '\', = ':', '.', etc. as if they=0A>> contained those characters, resulting in dir= ectory escape=0A>> vulnerabilities.=0A>> =0A>> Absolutely, and this is some= thing that I am checking anyway. But=0A>> there is also the special case w= here an ill-formed utf-8 byte=0A>> sequence can still result in a valid cod= e point, which can then be=0A>> safely converted to utf-16. These cases, w= hich are generally known=0A>> as the problem of the "non shortest form," pe= rtain to byte=0A>> sequences that used to be valid before Unicode version 3= .1, but are=0A>> now forbidden, hence table 3-7 of the current (6.2) standa= rd.=0A> =0A> What I was saying is that you don't have this problem if you'r= e =0A> parsing/decoding UTF-8 correctly. And parsing it correctly is not = =0A> harder/slower than doing it the way that results in misinterpreting = =0A> illegal sequences as "non shortest form" for other characters. A=0A> g= ood treatment of the subject (and near-optimal implementation) is=0A> here:= =0A> =0A> http://bjoern.hoehrmann.de/utf-8/decoder/dfa/=0A> =0A> My impleme= ntation in musl is based on the same ideas (UTF-8 decoding =0A> as a state = machine rather than complex conditionals) but I reduced=0A> the size of the= state from two ints to just one and reduced the size=0A> of the state tabl= e significantly by essentially encoding the=0A> transitions and partial cha= racter values into the state values.=0A=0AThanks for the tips and reference= . Once everything else is working I'll=0Acertainly switch to a method that= follows either your, or Hoehrmann's=0Aoptimization (which I'll admittedly = need more than a few minutes to=0Aunderstand...) For the time being I am l= eaving the set of conditionals=0Athat follows the standard and table 3-7, a= s that is very easy to=0Aimplement. And with the target strings being rela= tive shortness,=0Ahopefully this won't even bear any real performance conse= quences.=0A=0A> =0A> If you're making UTF-8 to UTF-16 conversions to feed t= o the Windows =0A> kernel filesystem code, I'd do them at the last possible= opportunity =0A> before passing the strings to the kernel, and just genera= te a fake =0A> error equivalent to "file does not exist" or "invalid filena= me" if=0A> the conversion encounters any illegal sequences.=0A=0AIndeed, th= at is exactly how I am doing this.=0Azg=0A=0A> =0A> Rich=0A=0A