mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Will Dietz <w@wdtz.org>
To: musl@lists.openwall.com
Subject: Re: iconv UTF-8 <--> CP1255 roundtrip possible bug?
Date: Thu, 14 Jun 2018 14:37:48 -0500	[thread overview]
Message-ID: <CAKGWAO8EUJokpuQmX_mzc=VVfcpkEW1_=m4JFnjoy3xD+EHwmQ@mail.gmail.com> (raw)
In-Reply-To: <20180603022635.GV1392@brightrain.aerifal.cx>

Nothing yet, I've not been able to spend more time on this lately sorry :).

I'll let you know if I do find anything,
and at least I'll be trying the latest changes you've pushed.

(yay all issues I've run into appear fixed :D)

Thanks!

~Will

On Sat, Jun 2, 2018 at 9:26 PM, Rich Felker <dalias@libc.org> wrote:
> On Wed, May 16, 2018 at 08:48:08PM -0500, Will Dietz wrote:
>> On Wed, May 16, 2018 at 6:04 PM, Rich Felker <dalias@libc.org> wrote:
>> > On Wed, May 16, 2018 at 12:22:36PM -0500, Will Dietz wrote:
>> >> I admit to being a bit unsure, but the behavior shown below doesn't
>> >> seem obviously right --LMK if I'm missing something :).
>> >>
>> >> Input file attached for inspection without relying on it getting
>> >> through byte-identical to what I have--
>> >> indeed I'm not sure copy+paste into this is working correctly (the
>> >> characters look different in my terminal :)).  Anyway:
>> >>
>> >> $ cat cp1255-snippet.xxd
>> >> 00000000: efac b3d6 b8d7 9d0a                      ........
>> >> $ xxd -r cp1255-snippet.xxd
>> >> דָּם
>> >>
>> >> Attempt to round-trip this from UTF-8 to CP1255 and back,
>> >> first with glibc's iconv (2.26):
>> >>
>> >> $ xxd -r cp1255-snippet.xxd|iconv -f UTF-8 -t CP1255|iconv -f CP1255
>> >> -t UTF-8 | xxd
>> >> 00000000: efac b3d6 b8d7 9d0a
>> >>
>> >> Looks good, same as what was sent in.
>> >>
>> >> Using musl-based iconv utility (1.1.19):
>> >> $ xxd -r cp1255-snippet.xxd|$ICONV -f UTF-8 -t CP1255|$ICONV -f CP1255
>> >> -t UTF-8 | xxd
>> >> 00000000: 2ad6 b8d7 9d0a                           *.....
>> >>
>> >> Indeed, the result looks different than what was started with:
>> >>
>> >> *ָם
>> >>
>> >> (again apologies if that doesn't survive mailing and such)
>> >>
>> >> This input was taken from gnu libiconv's test suite, in particular the
>> >> first line of tests/CP1255-snippet.UTF-8.  Since it's 2 characters,
>> >> and test data, I hope there's no problem re:licensing O:).
>> >>
>> >> I've reproduced the same behavior using iconv() directly, I can share
>> >> that if that would be preferable. It's the same code from earlier
>> >> iconv threads on the ML.
>> >
>> > No need; it's easy to reproduce, and I'm leaning towards saying the
>> > test is invalid. U+FB33 is a precomposed ligature form (from the
>> > Alphabetic Presentation Forms block), roughly equivalent in status to
>> > stuff like "fi" (U+FB01). An iconv implementation could perform an
>> > approximate conversion for such characters, returning a positive value
>> > indicating the number of such substitutions made, but silently
>> > converting it in a lossy way is not conforming, and of there's
>> > apparently no lossless way to convert it since CP1255 has no dedicated
>> > character slot for it (at least based on the definition of the
>> > codepage I'm using).
>>
>> Thanks for looking into this and for the great information!
>> I'll investigate more tomorrow, but wanted to respond to your inquiry
>> since it's easy to produce and might help explain things :).
>
> Any further findings?
>
>> > Do you know how/why they expect it to round-trip? What does glibc do
>> > when converting it -- can you show the intermediate (CP1255) form as a
>> > hexdump?
>>
>> Sure!
>>
>> Here's the intermediates for libiconv first, then w/musl:
>>
>> $ cat libiconv-cp1255.xxd
>> 00000000: e3cc c8ed 0a                             .....
>> $ cat musl-iconv-cp1255.xxd
>> 00000000: 2ac8 ed0a                                *...
>
> This is a plausible/reasonable conversion GNU iconv is doing...
>
>> Here's what happens when each of these are feed through both:
>>
>> ---- using libiconv's intermediate:
>> $ xxd -r ./libiconv-cp1255.xxd|iconv -f CP1255 -t UTF-8
>> דָּם
>> $ xxd -r ./libiconv-cp1255.xxd|iconv -f CP1255 -t UTF-8|xxd
>> 00000000: efac b3d6 b8d7 9d0a                      ........
>> $ xxd -r ./libiconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8
>> דָּם
>> $ xxd -r ./libiconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8|xxd
>> 00000000: d793 d6bc d6b8 d79d 0a                   .........
>
> ...but the GNU iconv behavior here is completely unreasonable/wrong.
> The first character it outputs is a presentation form for a ligature.
> There is no reason iconv should be doing this kind of renormalization
> when the original representation as two separate characters is
> available in the dest charset.
>
> Rich


      reply	other threads:[~2018-06-14 19:37 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-16 17:22 Will Dietz
2018-05-16 23:04 ` Rich Felker
2018-05-17  1:48   ` Will Dietz
2018-06-03  2:26     ` Rich Felker
2018-06-14 19:37       ` Will Dietz [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKGWAO8EUJokpuQmX_mzc=VVfcpkEW1_=m4JFnjoy3xD+EHwmQ@mail.gmail.com' \
    --to=w@wdtz.org \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).