RE: validation of utf-8 strings passed as system call arguments

mailing list of musl libc
 help / color / mirror / code / Atom feed

* RE: validation of utf-8 strings passed as system call arguments
@ 2013-12-13 20:23 writeonce
  0 siblings, 0 replies; 10+ messages in thread
From: writeonce @ 2013-12-13 20:23 UTC (permalink / raw)
  To: musl

On 12/13/2013 02:46 PM, Rich Felker wrote:
> On Fri, Dec 13, 2013 at 11:57:54AM -0700, writeonce@midipix.org
> wrote:
>> There's no way to convert between UTF-8 and UTF-16 without 
>> parsing/decoding the UTF-8, which includes validating it for free
>> if your parser is written properly. Failure to validate would lead
>> to all sorts of bugs, many of them dangerous, including things like
>> treating strings not containing '/', '\', ':', '.', etc. as if they
>> contained those characters, resulting in directory escape
>> vulnerabilities.
>> 
>> Absolutely, and this is something that I am checking anyway.  But
>> there is also the special case where an ill-formed utf-8 byte
>> sequence can still result in a valid code point, which can then be
>> safely converted to utf-16.  These cases, which are generally known
>> as the problem of the "non shortest form," pertain to byte
>> sequences that used to be valid before Unicode version 3.1, but are
>> now forbidden, hence table 3-7 of the current (6.2) standard.
> 
> What I was saying is that you don't have this problem if you're 
> parsing/decoding UTF-8 correctly. And parsing it correctly is not 
> harder/slower than doing it the way that results in misinterpreting 
> illegal sequences as "non shortest form" for other characters. A
> good treatment of the subject (and near-optimal implementation) is
> here:
> 
> http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
> 
> My implementation in musl is based on the same ideas (UTF-8 decoding 
> as a state machine rather than complex conditionals) but I reduced
> the size of the state from two ints to just one and reduced the size
> of the state table significantly by essentially encoding the
> transitions and partial character values into the state values.

Thanks for the tips and reference.  Once everything else is working I'll
certainly switch to a method that follows either your, or Hoehrmann's
optimization (which I'll admittedly need more than a few minutes to
understand...)  For the time being I am leaving the set of conditionals
that follows the standard and table 3-7, as that is very easy to
implement.  And with the target strings being relative shortness,
hopefully this won't even bear any real performance consequences.

> 
> If you're making UTF-8 to UTF-16 conversions to feed to the Windows 
> kernel filesystem code, I'd do them at the last possible opportunity 
> before passing the strings to the kernel, and just generate a fake 
> error equivalent to "file does not exist" or "invalid filename" if
> the conversion encounters any illegal sequences.

Indeed, that is exactly how I am doing this.
zg

> 
> Rich




^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: validation of utf-8 strings passed as system call arguments
@ 2013-12-13 18:57 writeonce
  2013-12-13 19:46 ` Rich Felker
  0 siblings, 1 reply; 10+ messages in thread
From: writeonce @ 2013-12-13 18:57 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/html, Size: 2181 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: validation of utf-8 strings passed as system call arguments
  2013-12-13 18:57 writeonce
@ 2013-12-13 19:46 ` Rich Felker
  0 siblings, 0 replies; 10+ messages in thread
From: Rich Felker @ 2013-12-13 19:46 UTC (permalink / raw)
  To: musl

On Fri, Dec 13, 2013 at 11:57:54AM -0700, writeonce@midipix.org wrote:
>  There's no way to convert between UTF-8 and UTF-16 without
>  parsing/decoding the UTF-8, which includes validating it for free if
>  your parser is written properly. Failure to validate would lead to all
>  sorts of bugs, many of them dangerous, including things like treating
>  strings not containing '/', '\', ':', '.', etc. as if they contained
>  those characters, resulting in directory escape vulnerabilities.
> 
>    Absolutely, and this is something that I am checking anyway.  But there is
>    also the special case where an ill-formed utf-8 byte sequence can still
>    result in a valid code point, which can then be safely converted to
>    utf-16.  These cases, which are generally known as the problem of the "non
>    shortest form," pertain to byte sequences that used to be valid before
>    Unicode version 3.1, but are now forbidden, hence table 3-7 of the current
>    (6.2) standard.

What I was saying is that you don't have this problem if you're
parsing/decoding UTF-8 correctly. And parsing it correctly is not
harder/slower than doing it the way that results in misinterpreting
illegal sequences as "non shortest form" for other characters. A good
treatment of the subject (and near-optimal implementation) is here:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

My implementation in musl is based on the same ideas (UTF-8 decoding
as a state machine rather than complex conditionals) but I reduced the
size of the state from two ints to just one and reduced the size of
the state table significantly by essentially encoding the transitions
and partial character values into the state values.

If you're making UTF-8 to UTF-16 conversions to feed to the Windows
kernel filesystem code, I'd do them at the last possible opportunity
before passing the strings to the kernel, and just generate a fake
error equivalent to "file does not exist" or "invalid filename" if the
conversion encounters any illegal sequences.

Rich

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: validation of utf-8 strings passed as system call arguments
@ 2013-12-13 12:52 writeonce
  2013-12-13 17:28 ` Rich Felker
  0 siblings, 1 reply; 10+ messages in thread
From: writeonce @ 2013-12-13 12:52 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/html, Size: 3774 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: validation of utf-8 strings passed as system call arguments
  2013-12-13 12:52 writeonce
@ 2013-12-13 17:28 ` Rich Felker
  0 siblings, 0 replies; 10+ messages in thread
From: Rich Felker @ 2013-12-13 17:28 UTC (permalink / raw)
  To: musl

On Fri, Dec 13, 2013 at 05:52:35AM -0700, writeonce@midipix.org wrote:
>    As always, you are absolutely right:-)  but my situation is slightly
>    different, though; the input I receive is expected to be in utf-8, but the
>    nt kernel only accepts utf-16.  This means that I need to choose between
>    conversion that is based on bit distribution only, which might  produce
>    ill-formed utf-16 byte sequences, or do all the validation on my end
>    despite the minor performance penalty.  Since path strings are normally
>    only a few hundred bytes long, and given that the nt kernel cannot be
>    (easily) debugged from my end, I'm leaning towards the latter option.

There's no way to convert between UTF-8 and UTF-16 without
parsing/decoding the UTF-8, which includes validating it for free if
your parser is written properly. Failure to validate would lead to all
sorts of bugs, many of them dangerous, including things like treating
strings not containing '/', '\', ':', '.', etc. as if they contained
those characters, resulting in directory escape vulnerabilities.

Rich


^ permalink raw reply	[flat|nested] 10+ messages in thread

* validation of utf-8 strings passed as system call arguments
@ 2013-12-13  4:30 writeonce
  2013-12-13  4:39 ` Rich Felker
  2013-12-13 12:11 ` Luca Barbato
  0 siblings, 2 replies; 10+ messages in thread
From: writeonce @ 2013-12-13  4:30 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/html, Size: 1039 bytes --]

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: open__ill_formed_utf8.c --]
[-- Type: text/x-c; name="open__ill_formed_utf8.c";, Size: 455 bytes --]

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>

int main (int argc, char * argv[], char * envp[])
{
	char path[] = {0xE0, 0x9F, 0x80, 0x00};
	mode_t mode = S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH;

	int fd = open (path, O_WRONLY | O_EXCL | O_CREAT, mode);
	
	if (fd == -1) {
		perror ("open");
		return 2;
	} else {
		printf("It worked! The file descriptor is %d.\n",fd);
	}
	
	return 0;
}


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: validation of utf-8 strings passed as system call arguments
  2013-12-13  4:30 writeonce
@ 2013-12-13  4:39 ` Rich Felker
  2013-12-13  6:36   ` Szabolcs Nagy
  2013-12-13 12:11 ` Luca Barbato
  1 sibling, 1 reply; 10+ messages in thread
From: Rich Felker @ 2013-12-13  4:39 UTC (permalink / raw)
  To: musl

On Thu, Dec 12, 2013 at 09:30:06PM -0700, writeonce@midipix.org wrote:
>    Hello,
> 
>    While working on code that converts arguments from utf-16 to utf-8, I
>    found myself wondering about the "responsibility" for checking
>    well-formedness of utf-8 strings that are passed to the kernel.  As I
>    suspected, validation of these strings takes place neither in the kernel,
>    nor in the C library.  The attached program demonstrates this by creating
>    a file named <0xE0 0x9F 0x80>, which according to the Unicode Standard
>    (6.2, p. 95) is an ill-formed byte sequence.
> 
>    I am not sure whether this can officially be considered a bug, and it is
>    quite clear that fixing this is going to entail some performance penalty. 
>    That being said, after deleting this file from my Ubuntu desktop most (but
>    not all) attempts to open the Trash folder made Nautilus crash, and it was
>    only after deleting the file permanently from the shell that order had
>    been restored...

There's nothing in POSIX that says that filenames have to be valid
strings in the current locale's encoding -- in fact, this is highly
problematic to enforce on implementations other than musl, such as
glibc, where the encoding might vary by locale and where different
users might be using locales with different encodings.

But there's also nothing that says arbitrary byte sequences (excluding
of course those containing '/' and NUL) have to be accepted as
filenames either. The historical _expectation_ and practice has been
that filenames can contain arbitrary byte sequences. And Linus in
particular is opposed to changing this, though there's been some
indicastion (I don't have references right off) that he might be open
to optional restrictions at the kernel level.

What's clear to me is that restrictions at the libc level are not
useful. If your concern is that creating files with illegal sequences
in their names can confuse/break/crash some software, adding a
restriction on file creation in libc won't help. A malicious user can
just make the syscalls directly to make malicious filenames. On the
other hand, having the restriction in libc would be annoying because
it would _prevent_ you from renaming or deleting these bad filenames
using standard tools; you'd have to use special tools that make the
syscalls directly.

So if you want protection against illegal sequences in filenames
(personally, I want this too) the right place to lobby for it (and
propose an optional feature) is in the kernel, not in libc.

Rich

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: validation of utf-8 strings passed as system call arguments
  2013-12-13  4:39 ` Rich Felker
@ 2013-12-13  6:36   ` Szabolcs Nagy
  2013-12-13  6:49     ` Rich Felker
  0 siblings, 1 reply; 10+ messages in thread
From: Szabolcs Nagy @ 2013-12-13  6:36 UTC (permalink / raw)
  To: musl

* Rich Felker <dalias@aerifal.cx> [2013-12-12 23:39:41 -0500]:
> that filenames can contain arbitrary byte sequences. And Linus in
> particular is opposed to changing this, though there's been some
> indicastion (I don't have references right off) that he might be open
> to optional restrictions at the kernel level.

he didnt look very persuadable some time ago
http://yarchive.net/comp/linux/utf8.html

(i actually like the kernel that way: what would you do when
mounting a filesystem with invalid filenames? would you also
reject surrogate pairs, pua codes or do unicode normalization?)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: validation of utf-8 strings passed as system call arguments
  2013-12-13  6:36   ` Szabolcs Nagy
@ 2013-12-13  6:49     ` Rich Felker
  0 siblings, 0 replies; 10+ messages in thread
From: Rich Felker @ 2013-12-13  6:49 UTC (permalink / raw)
  To: musl

On Fri, Dec 13, 2013 at 07:36:51AM +0100, Szabolcs Nagy wrote:
> * Rich Felker <dalias@aerifal.cx> [2013-12-12 23:39:41 -0500]:
> > that filenames can contain arbitrary byte sequences. And Linus in
> > particular is opposed to changing this, though there's been some
> > indicastion (I don't have references right off) that he might be open
> > to optional restrictions at the kernel level.
> 
> he didnt look very persuadable some time ago
> http://yarchive.net/comp/linux/utf8.html

Yes, that was a long time ago though. I forget where I saw an
indication that this could change (perhaps the Austin Group list? in
the thread about newlines...) but the general idea, if I recall, was
that restrictions would take place in the framework of a generic layer
for restricting malicious content in filenames that's not UTF-8
specific.

> (i actually like the kernel that way: what would you do when
> mounting a filesystem with invalid filenames? would you also
> reject surrogate pairs, pua codes or do unicode normalization?)

"Surrogate pairs" aren't even a question; surrogates aren't encodable
at all in UTF-8. So they would automatically be gone just by mandating
well-formed UTF-8.

Normalization (which Apple does) is absolutely wrong and
non-conforming to POSIX; it causes multiple distinct names to refer to
the same file (despite having a link count of 1, BTW), which is just
as dangerous as issues like "over-long sequence" decoding and
URL-escaped dots and slashes. The only "correct" way to do
normalization at the FS level is disallowing non-normalized filenames.
But normalization is actually just broken and harmful anyway, since
there are languages for which bugs in Unicode have made the normalized
form contrary to the actual semantic ordering of characters in the
language (characters were incorrectly assigned combining classes such
that letters reorder contrary to their actual semantic order, and due
to stability policy this can't be fixed, so the only solution is to
forget about using normalization).

As for PUA, it wouldn't be forbidden by enforcing UTF-8. Per the
definition, a "UTF" is a bijective mapping between the Unicode scalar
values (0 through 0xD7FF and 0xE000 through 0x10FFFF) and legal
sequences of code units. Whether a character identity is assigned to a
scalar value is irrelevant to UTFs.

Rich

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: validation of utf-8 strings passed as system call arguments
  2013-12-13  4:30 writeonce
  2013-12-13  4:39 ` Rich Felker
@ 2013-12-13 12:11 ` Luca Barbato
  1 sibling, 0 replies; 10+ messages in thread
From: Luca Barbato @ 2013-12-13 12:11 UTC (permalink / raw)
  To: musl

On 13/12/13 05:30, writeonce@midipix.org wrote:
> Hello,
> 
> While working on code that converts arguments from utf-16 to utf-8, I found 
> myself wondering about the "responsibility" for checking well-formedness of 
> utf-8 strings that are passed to the kernel.  As I suspected, validation of 
> these strings takes place neither in the kernel, nor in the C library.  The 
> attached program demonstrates this by creating a file named <0xE0 0x9F 0x80>, 
> which according to the Unicode Standard (6.2, p. 95) is an ill-formed byte sequence.
> 
> I am not sure whether this can officially be considered a bug, and it is quite 
> clear that fixing this is going to entail some performance penalty.  That being 
> said, after deleting this file from my Ubuntu desktop most (but not all) 
> attempts to open the Trash folder made Nautilus crash, and it was only after 
> deleting the file permanently from the shell that order had been restored...
> 

any kind of rejection beside null and separator seems to me that would
be more harmful and even more dangerous than the status quo.

lu


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-12-13 20:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-13 20:23 validation of utf-8 strings passed as system call arguments writeonce
  -- strict thread matches above, loose matches on Subject: below --
2013-12-13 18:57 writeonce
2013-12-13 19:46 ` Rich Felker
2013-12-13 12:52 writeonce
2013-12-13 17:28 ` Rich Felker
2013-12-13  4:30 writeonce
2013-12-13  4:39 ` Rich Felker
2013-12-13  6:36   ` Szabolcs Nagy
2013-12-13  6:49     ` Rich Felker
2013-12-13 12:11 ` Luca Barbato

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).