iconv UTF-8 <--> CP1255 roundtrip possible bug?

mailing list of musl libc
 help / color / mirror / code / Atom feed

* iconv UTF-8 <--> CP1255 roundtrip possible bug?
@ 2018-05-16 17:22 Will Dietz
  2018-05-16 23:04 ` Rich Felker
  0 siblings, 1 reply; 5+ messages in thread
From: Will Dietz @ 2018-05-16 17:22 UTC (permalink / raw)
  To: musl

I admit to being a bit unsure, but the behavior shown below doesn't
seem obviously right --LMK if I'm missing something :).

Input file attached for inspection without relying on it getting
through byte-identical to what I have--
indeed I'm not sure copy+paste into this is working correctly (the
characters look different in my terminal :)).  Anyway:

$ cat cp1255-snippet.xxd
00000000: efac b3d6 b8d7 9d0a                      ........
$ xxd -r cp1255-snippet.xxd
דָּם

Attempt to round-trip this from UTF-8 to CP1255 and back,
first with glibc's iconv (2.26):

$ xxd -r cp1255-snippet.xxd|iconv -f UTF-8 -t CP1255|iconv -f CP1255
-t UTF-8 | xxd
00000000: efac b3d6 b8d7 9d0a

Looks good, same as what was sent in.

Using musl-based iconv utility (1.1.19):
$ xxd -r cp1255-snippet.xxd|$ICONV -f UTF-8 -t CP1255|$ICONV -f CP1255
-t UTF-8 | xxd
00000000: 2ad6 b8d7 9d0a                           *.....

Indeed, the result looks different than what was started with:

*ָם

(again apologies if that doesn't survive mailing and such)

This input was taken from gnu libiconv's test suite, in particular the
first line of tests/CP1255-snippet.UTF-8.  Since it's 2 characters,
and test data, I hope there's no problem re:licensing O:).

I've reproduced the same behavior using iconv() directly, I can share
that if that would be preferable. It's the same code from earlier
iconv threads on the ML.

--------------

Hopefully this is useful!

On the subject, a question or two if it's not too much trouble:

* is the above what's meant by "round-trip" as discussed in[1]?
* What sorts of "round-trip" conversions are expected to work? And
over what inputs should round-trip conversions work-- for any 'valid"
UTF-8 or so?

Armed with some insights regarding these questions, I'm hoping to
scope out something that can be tested or (no promises!) perhaps
pushed through some formal verification goodness.  But also I'm just
curious :).

Thanks!

~Will

[1] http://www.openwall.com/lists/musl/2018/02/27/2

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: iconv UTF-8 <--> CP1255 roundtrip possible bug?
  2018-05-16 17:22 iconv UTF-8 <--> CP1255 roundtrip possible bug? Will Dietz
@ 2018-05-16 23:04 ` Rich Felker
  2018-05-17  1:48   ` Will Dietz
  0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2018-05-16 23:04 UTC (permalink / raw)
  To: musl

On Wed, May 16, 2018 at 12:22:36PM -0500, Will Dietz wrote:
> I admit to being a bit unsure, but the behavior shown below doesn't
> seem obviously right --LMK if I'm missing something :).
> 
> Input file attached for inspection without relying on it getting
> through byte-identical to what I have--
> indeed I'm not sure copy+paste into this is working correctly (the
> characters look different in my terminal :)).  Anyway:
> 
> $ cat cp1255-snippet.xxd
> 00000000: efac b3d6 b8d7 9d0a                      ........
> $ xxd -r cp1255-snippet.xxd
> דָּם
> 
> Attempt to round-trip this from UTF-8 to CP1255 and back,
> first with glibc's iconv (2.26):
> 
> $ xxd -r cp1255-snippet.xxd|iconv -f UTF-8 -t CP1255|iconv -f CP1255
> -t UTF-8 | xxd
> 00000000: efac b3d6 b8d7 9d0a
> 
> Looks good, same as what was sent in.
> 
> Using musl-based iconv utility (1.1.19):
> $ xxd -r cp1255-snippet.xxd|$ICONV -f UTF-8 -t CP1255|$ICONV -f CP1255
> -t UTF-8 | xxd
> 00000000: 2ad6 b8d7 9d0a                           *.....
> 
> Indeed, the result looks different than what was started with:
> 
> *ָם
> 
> (again apologies if that doesn't survive mailing and such)
> 
> This input was taken from gnu libiconv's test suite, in particular the
> first line of tests/CP1255-snippet.UTF-8.  Since it's 2 characters,
> and test data, I hope there's no problem re:licensing O:).
> 
> I've reproduced the same behavior using iconv() directly, I can share
> that if that would be preferable. It's the same code from earlier
> iconv threads on the ML.

No need; it's easy to reproduce, and I'm leaning towards saying the
test is invalid. U+FB33 is a precomposed ligature form (from the
Alphabetic Presentation Forms block), roughly equivalent in status to
stuff like "ﬁ" (U+FB01). An iconv implementation could perform an
approximate conversion for such characters, returning a positive value
indicating the number of such substitutions made, but silently
converting it in a lossy way is not conforming, and of there's
apparently no lossless way to convert it since CP1255 has no dedicated
character slot for it (at least based on the definition of the
codepage I'm using).

Do you know how/why they expect it to round-trip? What does glibc do
when converting it -- can you show the intermediate (CP1255) form as a
hexdump?

> --------------
> 
> Hopefully this is useful!
> 
> On the subject, a question or two if it's not too much trouble:
> 
> * is the above what's meant by "round-trip" as discussed in[1]?
> * What sorts of "round-trip" conversions are expected to work? And
> over what inputs should round-trip conversions work-- for any 'valid"
> UTF-8 or so?

Any UTF-8 whose content is representable in the encoding you're asking
about round-tripping through, i.e. where the first iconv returns 0.

> Armed with some insights regarding these questions, I'm hoping to
> scope out something that can be tested or (no promises!) perhaps
> pushed through some formal verification goodness.  But also I'm just
> curious :).

Yay!

Rich


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: iconv UTF-8 <--> CP1255 roundtrip possible bug?
  2018-05-16 23:04 ` Rich Felker
@ 2018-05-17  1:48   ` Will Dietz
  2018-06-03  2:26     ` Rich Felker
  0 siblings, 1 reply; 5+ messages in thread
From: Will Dietz @ 2018-05-17  1:48 UTC (permalink / raw)
  To: musl

On Wed, May 16, 2018 at 6:04 PM, Rich Felker <dalias@libc.org> wrote:
> On Wed, May 16, 2018 at 12:22:36PM -0500, Will Dietz wrote:
>> I admit to being a bit unsure, but the behavior shown below doesn't
>> seem obviously right --LMK if I'm missing something :).
>>
>> Input file attached for inspection without relying on it getting
>> through byte-identical to what I have--
>> indeed I'm not sure copy+paste into this is working correctly (the
>> characters look different in my terminal :)).  Anyway:
>>
>> $ cat cp1255-snippet.xxd
>> 00000000: efac b3d6 b8d7 9d0a                      ........
>> $ xxd -r cp1255-snippet.xxd
>> דָּם
>>
>> Attempt to round-trip this from UTF-8 to CP1255 and back,
>> first with glibc's iconv (2.26):
>>
>> $ xxd -r cp1255-snippet.xxd|iconv -f UTF-8 -t CP1255|iconv -f CP1255
>> -t UTF-8 | xxd
>> 00000000: efac b3d6 b8d7 9d0a
>>
>> Looks good, same as what was sent in.
>>
>> Using musl-based iconv utility (1.1.19):
>> $ xxd -r cp1255-snippet.xxd|$ICONV -f UTF-8 -t CP1255|$ICONV -f CP1255
>> -t UTF-8 | xxd
>> 00000000: 2ad6 b8d7 9d0a                           *.....
>>
>> Indeed, the result looks different than what was started with:
>>
>> *ָם
>>
>> (again apologies if that doesn't survive mailing and such)
>>
>> This input was taken from gnu libiconv's test suite, in particular the
>> first line of tests/CP1255-snippet.UTF-8.  Since it's 2 characters,
>> and test data, I hope there's no problem re:licensing O:).
>>
>> I've reproduced the same behavior using iconv() directly, I can share
>> that if that would be preferable. It's the same code from earlier
>> iconv threads on the ML.
>
> No need; it's easy to reproduce, and I'm leaning towards saying the
> test is invalid. U+FB33 is a precomposed ligature form (from the
> Alphabetic Presentation Forms block), roughly equivalent in status to
> stuff like "ﬁ" (U+FB01). An iconv implementation could perform an
> approximate conversion for such characters, returning a positive value
> indicating the number of such substitutions made, but silently
> converting it in a lossy way is not conforming, and of there's
> apparently no lossless way to convert it since CP1255 has no dedicated
> character slot for it (at least based on the definition of the
> codepage I'm using).

Thanks for looking into this and for the great information!
I'll investigate more tomorrow, but wanted to respond to your inquiry
since it's easy to produce and might help explain things :).

>
> Do you know how/why they expect it to round-trip? What does glibc do
> when converting it -- can you show the intermediate (CP1255) form as a
> hexdump?

Sure!

Here's the intermediates for libiconv first, then w/musl:

$ cat libiconv-cp1255.xxd
00000000: e3cc c8ed 0a                             .....
$ cat musl-iconv-cp1255.xxd
00000000: 2ac8 ed0a                                *...

Here's what happens when each of these are feed through both:

---- using libiconv's intermediate:
$ xxd -r ./libiconv-cp1255.xxd|iconv -f CP1255 -t UTF-8
דָּם
$ xxd -r ./libiconv-cp1255.xxd|iconv -f CP1255 -t UTF-8|xxd
00000000: efac b3d6 b8d7 9d0a                      ........
$ xxd -r ./libiconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8
דָּם
$ xxd -r ./libiconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8|xxd
00000000: d793 d6bc d6b8 d79d 0a                   .........


--- using musl's intermediate

                  *.....
$ xxd -r ./musl-iconv-cp1255.xxd|iconv -f CP1255 -t UTF-8
*ָם
$ xxd -r ./musl-iconv-cp1255.xxd|iconv -f CP1255 -t UTF-8|xxd
00000000: 2ad6 b8d7 9d0a                           *.....
$ xxd -r ./musl-iconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8
*ָם
$ xxd -r ./musl-iconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8|xxd
00000000: 2ad6 b8d7 9d0a

For whatever that's worth! :)

>
>> --------------
>>
>> Hopefully this is useful!
>>
>> On the subject, a question or two if it's not too much trouble:
>>
>> * is the above what's meant by "round-trip" as discussed in[1]?
>> * What sorts of "round-trip" conversions are expected to work? And
>> over what inputs should round-trip conversions work-- for any 'valid"
>> UTF-8 or so?
>
> Any UTF-8 whose content is representable in the encoding you're asking
> about round-tripping through, i.e. where the first iconv returns 0.

Ah, okay.  I'll think about this and look into it a bit more.  Thanks
for answering my questions!

~Will

>
>> Armed with some insights regarding these questions, I'm hoping to
>> scope out something that can be tested or (no promises!) perhaps
>> pushed through some formal verification goodness.  But also I'm just
>> curious :).
>
> Yay!
>
> Rich


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: iconv UTF-8 <--> CP1255 roundtrip possible bug?
  2018-05-17  1:48   ` Will Dietz
@ 2018-06-03  2:26     ` Rich Felker
  2018-06-14 19:37       ` Will Dietz
  0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2018-06-03  2:26 UTC (permalink / raw)
  To: musl

On Wed, May 16, 2018 at 08:48:08PM -0500, Will Dietz wrote:
> On Wed, May 16, 2018 at 6:04 PM, Rich Felker <dalias@libc.org> wrote:
> > On Wed, May 16, 2018 at 12:22:36PM -0500, Will Dietz wrote:
> >> I admit to being a bit unsure, but the behavior shown below doesn't
> >> seem obviously right --LMK if I'm missing something :).
> >>
> >> Input file attached for inspection without relying on it getting
> >> through byte-identical to what I have--
> >> indeed I'm not sure copy+paste into this is working correctly (the
> >> characters look different in my terminal :)).  Anyway:
> >>
> >> $ cat cp1255-snippet.xxd
> >> 00000000: efac b3d6 b8d7 9d0a                      ........
> >> $ xxd -r cp1255-snippet.xxd
> >> דָּם
> >>
> >> Attempt to round-trip this from UTF-8 to CP1255 and back,
> >> first with glibc's iconv (2.26):
> >>
> >> $ xxd -r cp1255-snippet.xxd|iconv -f UTF-8 -t CP1255|iconv -f CP1255
> >> -t UTF-8 | xxd
> >> 00000000: efac b3d6 b8d7 9d0a
> >>
> >> Looks good, same as what was sent in.
> >>
> >> Using musl-based iconv utility (1.1.19):
> >> $ xxd -r cp1255-snippet.xxd|$ICONV -f UTF-8 -t CP1255|$ICONV -f CP1255
> >> -t UTF-8 | xxd
> >> 00000000: 2ad6 b8d7 9d0a                           *.....
> >>
> >> Indeed, the result looks different than what was started with:
> >>
> >> *ָם
> >>
> >> (again apologies if that doesn't survive mailing and such)
> >>
> >> This input was taken from gnu libiconv's test suite, in particular the
> >> first line of tests/CP1255-snippet.UTF-8.  Since it's 2 characters,
> >> and test data, I hope there's no problem re:licensing O:).
> >>
> >> I've reproduced the same behavior using iconv() directly, I can share
> >> that if that would be preferable. It's the same code from earlier
> >> iconv threads on the ML.
> >
> > No need; it's easy to reproduce, and I'm leaning towards saying the
> > test is invalid. U+FB33 is a precomposed ligature form (from the
> > Alphabetic Presentation Forms block), roughly equivalent in status to
> > stuff like "ﬁ" (U+FB01). An iconv implementation could perform an
> > approximate conversion for such characters, returning a positive value
> > indicating the number of such substitutions made, but silently
> > converting it in a lossy way is not conforming, and of there's
> > apparently no lossless way to convert it since CP1255 has no dedicated
> > character slot for it (at least based on the definition of the
> > codepage I'm using).
> 
> Thanks for looking into this and for the great information!
> I'll investigate more tomorrow, but wanted to respond to your inquiry
> since it's easy to produce and might help explain things :).

Any further findings?

> > Do you know how/why they expect it to round-trip? What does glibc do
> > when converting it -- can you show the intermediate (CP1255) form as a
> > hexdump?
> 
> Sure!
> 
> Here's the intermediates for libiconv first, then w/musl:
> 
> $ cat libiconv-cp1255.xxd
> 00000000: e3cc c8ed 0a                             .....
> $ cat musl-iconv-cp1255.xxd
> 00000000: 2ac8 ed0a                                *...

This is a plausible/reasonable conversion GNU iconv is doing...

> Here's what happens when each of these are feed through both:
> 
> ---- using libiconv's intermediate:
> $ xxd -r ./libiconv-cp1255.xxd|iconv -f CP1255 -t UTF-8
> דָּם
> $ xxd -r ./libiconv-cp1255.xxd|iconv -f CP1255 -t UTF-8|xxd
> 00000000: efac b3d6 b8d7 9d0a                      ........
> $ xxd -r ./libiconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8
> דָּם
> $ xxd -r ./libiconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8|xxd
> 00000000: d793 d6bc d6b8 d79d 0a                   .........

...but the GNU iconv behavior here is completely unreasonable/wrong.
The first character it outputs is a presentation form for a ligature.
There is no reason iconv should be doing this kind of renormalization
when the original representation as two separate characters is
available in the dest charset.

Rich


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: iconv UTF-8 <--> CP1255 roundtrip possible bug?
  2018-06-03  2:26     ` Rich Felker
@ 2018-06-14 19:37       ` Will Dietz
  0 siblings, 0 replies; 5+ messages in thread
From: Will Dietz @ 2018-06-14 19:37 UTC (permalink / raw)
  To: musl

Nothing yet, I've not been able to spend more time on this lately sorry :).

I'll let you know if I do find anything,
and at least I'll be trying the latest changes you've pushed.

(yay all issues I've run into appear fixed :D)

Thanks!

~Will

On Sat, Jun 2, 2018 at 9:26 PM, Rich Felker <dalias@libc.org> wrote:
> On Wed, May 16, 2018 at 08:48:08PM -0500, Will Dietz wrote:
>> On Wed, May 16, 2018 at 6:04 PM, Rich Felker <dalias@libc.org> wrote:
>> > On Wed, May 16, 2018 at 12:22:36PM -0500, Will Dietz wrote:
>> >> I admit to being a bit unsure, but the behavior shown below doesn't
>> >> seem obviously right --LMK if I'm missing something :).
>> >>
>> >> Input file attached for inspection without relying on it getting
>> >> through byte-identical to what I have--
>> >> indeed I'm not sure copy+paste into this is working correctly (the
>> >> characters look different in my terminal :)).  Anyway:
>> >>
>> >> $ cat cp1255-snippet.xxd
>> >> 00000000: efac b3d6 b8d7 9d0a                      ........
>> >> $ xxd -r cp1255-snippet.xxd
>> >> דָּם
>> >>
>> >> Attempt to round-trip this from UTF-8 to CP1255 and back,
>> >> first with glibc's iconv (2.26):
>> >>
>> >> $ xxd -r cp1255-snippet.xxd|iconv -f UTF-8 -t CP1255|iconv -f CP1255
>> >> -t UTF-8 | xxd
>> >> 00000000: efac b3d6 b8d7 9d0a
>> >>
>> >> Looks good, same as what was sent in.
>> >>
>> >> Using musl-based iconv utility (1.1.19):
>> >> $ xxd -r cp1255-snippet.xxd|$ICONV -f UTF-8 -t CP1255|$ICONV -f CP1255
>> >> -t UTF-8 | xxd
>> >> 00000000: 2ad6 b8d7 9d0a                           *.....
>> >>
>> >> Indeed, the result looks different than what was started with:
>> >>
>> >> *ָם
>> >>
>> >> (again apologies if that doesn't survive mailing and such)
>> >>
>> >> This input was taken from gnu libiconv's test suite, in particular the
>> >> first line of tests/CP1255-snippet.UTF-8.  Since it's 2 characters,
>> >> and test data, I hope there's no problem re:licensing O:).
>> >>
>> >> I've reproduced the same behavior using iconv() directly, I can share
>> >> that if that would be preferable. It's the same code from earlier
>> >> iconv threads on the ML.
>> >
>> > No need; it's easy to reproduce, and I'm leaning towards saying the
>> > test is invalid. U+FB33 is a precomposed ligature form (from the
>> > Alphabetic Presentation Forms block), roughly equivalent in status to
>> > stuff like "ﬁ" (U+FB01). An iconv implementation could perform an
>> > approximate conversion for such characters, returning a positive value
>> > indicating the number of such substitutions made, but silently
>> > converting it in a lossy way is not conforming, and of there's
>> > apparently no lossless way to convert it since CP1255 has no dedicated
>> > character slot for it (at least based on the definition of the
>> > codepage I'm using).
>>
>> Thanks for looking into this and for the great information!
>> I'll investigate more tomorrow, but wanted to respond to your inquiry
>> since it's easy to produce and might help explain things :).
>
> Any further findings?
>
>> > Do you know how/why they expect it to round-trip? What does glibc do
>> > when converting it -- can you show the intermediate (CP1255) form as a
>> > hexdump?
>>
>> Sure!
>>
>> Here's the intermediates for libiconv first, then w/musl:
>>
>> $ cat libiconv-cp1255.xxd
>> 00000000: e3cc c8ed 0a                             .....
>> $ cat musl-iconv-cp1255.xxd
>> 00000000: 2ac8 ed0a                                *...
>
> This is a plausible/reasonable conversion GNU iconv is doing...
>
>> Here's what happens when each of these are feed through both:
>>
>> ---- using libiconv's intermediate:
>> $ xxd -r ./libiconv-cp1255.xxd|iconv -f CP1255 -t UTF-8
>> דָּם
>> $ xxd -r ./libiconv-cp1255.xxd|iconv -f CP1255 -t UTF-8|xxd
>> 00000000: efac b3d6 b8d7 9d0a                      ........
>> $ xxd -r ./libiconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8
>> דָּם
>> $ xxd -r ./libiconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8|xxd
>> 00000000: d793 d6bc d6b8 d79d 0a                   .........
>
> ...but the GNU iconv behavior here is completely unreasonable/wrong.
> The first character it outputs is a presentation form for a ligature.
> There is no reason iconv should be doing this kind of renormalization
> when the original representation as two separate characters is
> available in the dest charset.
>
> Rich


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-06-14 19:37 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-16 17:22 iconv UTF-8 <--> CP1255 roundtrip possible bug? Will Dietz
2018-05-16 23:04 ` Rich Felker
2018-05-17  1:48   ` Will Dietz
2018-06-03  2:26     ` Rich Felker
2018-06-14 19:37       ` Will Dietz

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).