On 12/13/2013 12:28 PM, Rich Felker wrote:

On Fri, Dec 13, 2013 at 05:52:35AM -0700, writeonce@midipix.org wrote:

   As always, you are absolutely right:-)  but my situation is slightly
   different, though; the input I receive is expected to be in utf-8, but the
   nt kernel only accepts utf-16.  This means that I need to choose between
   conversion that is based on bit distribution only, which might  produce
   ill-formed utf-16 byte sequences, or do all the validation on my end
   despite the minor performance penalty.  Since path strings are normally
   only a few hundred bytes long, and given that the nt kernel cannot be
   (easily) debugged from my end, I'm leaning towards the latter option.

There's no way to convert between UTF-8 and UTF-16 without
parsing/decoding the UTF-8, which includes validating it for free if
your parser is written properly. Failure to validate would lead to all
sorts of bugs, many of them dangerous, including things like treating
strings not containing '/', '\', ':', '.', etc. as if they contained
those characters, resulting in directory escape vulnerabilities.

Absolutely, and this is something that I am checking anyway. But there is also the special case where an ill-formed utf-8 byte sequence can still result in a valid code point, which can then be safely converted to utf-16. These cases, which are generally known as the problem of the "non shortest form," pertain to byte sequences that used to be valid before Unicode version 3.1, but are now forbidden, hence table 3-7 of the current (6.2) standard.

zg

Rich