mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Rich Felker <dalias@aerifal.cx>
To: musl@lists.openwall.com
Subject: Re: validation of utf-8 strings passed as system call arguments
Date: Fri, 13 Dec 2013 14:46:40 -0500	[thread overview]
Message-ID: <20131213194640.GE24286@brightrain.aerifal.cx> (raw)
In-Reply-To: <20131213115754.dc30d64f61e5ec441c34ffd4f788e58e.3729304b11.wbe@email22.secureserver.net>

On Fri, Dec 13, 2013 at 11:57:54AM -0700, writeonce@midipix.org wrote:
>  There's no way to convert between UTF-8 and UTF-16 without
>  parsing/decoding the UTF-8, which includes validating it for free if
>  your parser is written properly. Failure to validate would lead to all
>  sorts of bugs, many of them dangerous, including things like treating
>  strings not containing '/', '\', ':', '.', etc. as if they contained
>  those characters, resulting in directory escape vulnerabilities.
> 
>    Absolutely, and this is something that I am checking anyway.  But there is
>    also the special case where an ill-formed utf-8 byte sequence can still
>    result in a valid code point, which can then be safely converted to
>    utf-16.  These cases, which are generally known as the problem of the "non
>    shortest form," pertain to byte sequences that used to be valid before
>    Unicode version 3.1, but are now forbidden, hence table 3-7 of the current
>    (6.2) standard.

What I was saying is that you don't have this problem if you're
parsing/decoding UTF-8 correctly. And parsing it correctly is not
harder/slower than doing it the way that results in misinterpreting
illegal sequences as "non shortest form" for other characters. A good
treatment of the subject (and near-optimal implementation) is here:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

My implementation in musl is based on the same ideas (UTF-8 decoding
as a state machine rather than complex conditionals) but I reduced the
size of the state from two ints to just one and reduced the size of
the state table significantly by essentially encoding the transitions
and partial character values into the state values.

If you're making UTF-8 to UTF-16 conversions to feed to the Windows
kernel filesystem code, I'd do them at the last possible opportunity
before passing the strings to the kernel, and just generate a fake
error equivalent to "file does not exist" or "invalid filename" if the
conversion encounters any illegal sequences.

Rich


  reply	other threads:[~2013-12-13 19:46 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-13 18:57 writeonce
2013-12-13 19:46 ` Rich Felker [this message]
  -- strict thread matches above, loose matches on Subject: below --
2013-12-13 20:23 writeonce
2013-12-13 12:52 writeonce
2013-12-13 17:28 ` Rich Felker
2013-12-13  4:30 writeonce
2013-12-13  4:39 ` Rich Felker
2013-12-13  6:36   ` Szabolcs Nagy
2013-12-13  6:49     ` Rich Felker
2013-12-13 12:11 ` Luca Barbato

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131213194640.GE24286@brightrain.aerifal.cx \
    --to=dalias@aerifal.cx \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).