On Fri, Dec 13, 2013 at 05:52:35AM -0700, writeonce@midipix.org wrote: As always, you are absolutely right:-) but my situation is slightly different, though; the input I receive is expected to be in utf-8, but the nt kernel only accepts utf-16. This means that I need to choose between conversion that is based on bit distribution only, which might produce ill-formed utf-16 byte sequences, or do all the validation on my end despite the minor performance penalty. Since path strings are normally only a few hundred bytes long, and given that the nt kernel cannot be (easily) debugged from my end, I'm leaning towards the latter option.There's no way to convert between UTF-8 and UTF-16 without parsing/decoding the UTF-8, which includes validating it for free if your parser is written properly. Failure to validate would lead to all sorts of bugs, many of them dangerous, including things like treating strings not containing '/', '\', ':', '.', etc. as if they contained those characters, resulting in directory escape vulnerabilities.
Absolutely, and this is something that I am checking anyway. But
there is also the special case where an ill-formed utf-8 byte sequence
can still result in a valid code point, which can then be safely converted to utf-16. These cases, which are generally known as the problem of
the "non shortest form," pertain to byte sequences that used to be valid
before Unicode version 3.1, but are now forbidden, hence
table 3-7 of the current (6.2) standard.
zg
Rich