From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/4395 Path: news.gmane.org!not-for-mail From: Newsgroups: gmane.linux.lib.musl.general Subject: RE: validation of utf-8 strings passed as system call arguments Date: Fri, 13 Dec 2013 05:52:35 -0700 Message-ID: <20131213055235.dc30d64f61e5ec441c34ffd4f788e58e.563168f7a8.wbe@email22.secureserver.net> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1386939167 6156 80.91.229.3 (13 Dec 2013 12:52:47 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 13 Dec 2013 12:52:47 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-4399-gllmg-musl=m.gmane.org@lists.openwall.com Fri Dec 13 13:52:53 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VrSEf-0001a9-1O for gllmg-musl@plane.gmane.org; Fri, 13 Dec 2013 13:52:53 +0100 Original-Received: (qmail 28067 invoked by uid 550); 13 Dec 2013 12:52:52 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 28059 invoked from network); 13 Dec 2013 12:52:51 -0000 X-SID: 10sd1n0012YkKj001 X-Originating-IP: 71.206.170.124 User-Agent: Workspace Webmail 5.6.45 Xref: news.gmane.org gmane.linux.lib.musl.general:4395 Archived-At:

On 12/12/2013 11:39 PM, Rich Felker wro= te:
=0A

=0A

=0A On Thu, Dec 12, 2013 at 0= 9:30:06PM -0700, writeonce@midipix= .org wrote:=0A=0A

   He=
llo,=0A=0A   While working on code that converts arguments from utf-16 to u=
tf-8, I=0A   found myself wondering about the "responsibility" for checking=
=0A   well-formedness of utf-8 strings that are passed to the kernel.  As I=
=0A   suspected, validation of these strings takes place neither in the ker=
nel,=0A   nor in the C library.  The attached program demonstrates this by =
creating=0A   a file named <0xE0 0x9F 0x80>, which according to the U=
nicode Standard=0A   (6.2, p. 95) is an ill-formed byte sequence.=0A=0A   I=
 am not sure whether this can officially be considered a bug, and it is=0A =
  quite clear that fixing this is going to entail some performance penalty.=
 =0A   That being said, after deleting this file from my Ubuntu desktop mos=
t (but=0A   not all) attempts to open the Trash folder made Nautilus crash,=
 and it was=0A   only after deleting the file permanently from the shell th=
at order had=0A   been restored...=0A

=0A

There's nothing in POSIX that says that filenames have to be valid=0Astrin=
gs in the current locale's encoding -- in fact, this is highly=0Aproblemati=
c to enforce on implementations other than musl, such as=0Aglibc, where the=
 encoding might vary by locale and where different=0Ausers might be using l=
ocales with different encodings.=0A=0ABut there's also nothing that says ar=
bitrary byte sequences (excluding=0Aof course those containing '/' and NUL)=
 have to be accepted as=0Afilenames either. The historical _expectation_ an=
d practice has been=0Athat filenames can contain arbitrary byte sequences. =
And Linus in=0Aparticular is opposed to changing this, though there's been =
some=0Aindicastion (I don't have references right off) that he might be ope=
n=0Ato optional restrictions at the kernel level.=0A=0AWhat's clear to me i=
s that restrictions at the libc level are not=0Auseful. If your concern is =
that creating files with illegal sequences=0Ain their names can confuse/bre=
ak/crash some software, adding a=0Arestriction on file creation in libc won=
't help. A malicious user can=0Ajust make the syscalls directly to make mal=
icious filenames. On the=0Aother hand, having the restriction in libc would=
 be annoying because=0Ait would _prevent_ you from renaming or deleting the=
se bad filenames=0Ausing standard tools; you'd have to use special tools th=
at make the=0Asyscalls directly.

=0A

=0AAs always, you = are absolutely right:-) but my situation is slightly =0Adifferent, th= ough; the input I receive is expected to be in utf-8, but =0Athe nt kernel = only accepts utf-16. This means that I need to choose =0Abetween conv= ersion that is based on bit distribution only, which might produce il= l-formed utf-16 byte sequences, or do all the =0Avalidation on my end despi= te the minor performance penalty. Since path =0Astrings are normally = only a few hundred bytes long, and given that the=0A nt kernel cannot be (e= asily) debugged from my end, I'm leaning towards =0Athe latter option.
= =0A
=0A

=0A =0ASo if you want protection= against illegal sequences in filenames=0A(personally, I want this too) the= right place to lobby for it (and=0Apropose an optional feature) is in the = kernel, not in libc.=0A=0ARich=0A=0A=0A=0A

=0A
=0A