From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/12829 Path: news.gmane.org!.POSTED!not-for-mail From: Will Dietz Newsgroups: gmane.linux.lib.musl.general Subject: Re: iconv UTF-8 <--> CP1255 roundtrip possible bug? Date: Wed, 16 May 2018 20:48:08 -0500 Message-ID: References: <20180516230425.GZ1392@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Trace: blaine.gmane.org 1526521579 16237 195.159.176.226 (17 May 2018 01:46:19 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Thu, 17 May 2018 01:46:19 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-12845-gllmg-musl=m.gmane.org@lists.openwall.com Thu May 17 03:46:15 2018 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1fJ7zh-00047A-6N for gllmg-musl@m.gmane.org; Thu, 17 May 2018 03:46:13 +0200 Original-Received: (qmail 22455 invoked by uid 550); 17 May 2018 01:48:22 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 22431 invoked from network); 17 May 2018 01:48:21 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wdtz.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-transfer-encoding; bh=cqmJMyvEWdYH4UhH6sAyAIlol1y7wA/OrCgnZ4fIvtQ=; b=NyC5mbBFrhqIx/D0VMeiOVg0Bsidau4JU6W1FqFnx+pvEbMMehcRYwLH2SrYaPcBYo di8cKWLKzPmmJ3Hi1KGrrZnm2oCmyIQwyODjiF85gwOYzctnBfZxIYrN5TwE9MEg6bgw 1WWZNEwvup2m5qnLPzytgRQb2kw7IRdt3XlvE= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-transfer-encoding; bh=cqmJMyvEWdYH4UhH6sAyAIlol1y7wA/OrCgnZ4fIvtQ=; b=Z1DLQD6VL/yO+l+vC4XtYPIkZSGP9M6OyeGESi6bHLlEqta46ItA5dV+WvRM2iioGz Q0BhLQrdXsXRHKL66Cz+mFhA27eeGnJC2YuPtuNSGob+OhLqZbhl77O6CfRlxiIMT1LD fIRgkKiUtPlpsHo6TGtMwHcITfWMTGJVzowoKQ9A2K+7wwTlKlFHFEISfPKhW+WJHPif qwVVHnjIImgBwmAJpa+aMxmUVsfinlhsgA8fzbbz7S3rGPjY1jvp5+vZZq4L2kVV9QPg TNl/z0AqtlVHhCttUGPiOZlCVCO7J/VsoSqW5dNnDsuZ8v1vZ9J877N7AxZ9WHyc8uIA 1uZA== X-Gm-Message-State: ALKqPwdXBRmz6kaxbRIyX2tw9OMRKKuca76QbZJvbbu+Zimm5djXkNCc uzptCLk82tyGEc4MzYFxbeTntrggveoEA51UhuquWnZd7g== X-Google-Smtp-Source: AB8JxZpR2SWjnloXpE4ZHj0zc592wbUi5tYGWhESLCU4SJbiD9HFqpg7UIc+ytFg4ii6D2qoTj/39Id8FGMo9ugBu7I= X-Received: by 2002:a9d:4881:: with SMTP id d1-v6mr2535312otf.353.1526521688796; Wed, 16 May 2018 18:48:08 -0700 (PDT) X-Originating-IP: [2602:306:304a:61c8::d68] In-Reply-To: <20180516230425.GZ1392@brightrain.aerifal.cx> Xref: news.gmane.org gmane.linux.lib.musl.general:12829 Archived-At: On Wed, May 16, 2018 at 6:04 PM, Rich Felker wrote: > On Wed, May 16, 2018 at 12:22:36PM -0500, Will Dietz wrote: >> I admit to being a bit unsure, but the behavior shown below doesn't >> seem obviously right --LMK if I'm missing something :). >> >> Input file attached for inspection without relying on it getting >> through byte-identical to what I have-- >> indeed I'm not sure copy+paste into this is working correctly (the >> characters look different in my terminal :)). Anyway: >> >> $ cat cp1255-snippet.xxd >> 00000000: efac b3d6 b8d7 9d0a ........ >> $ xxd -r cp1255-snippet.xxd >> =EF=AC=B3=D6=B8=D7=9D >> >> Attempt to round-trip this from UTF-8 to CP1255 and back, >> first with glibc's iconv (2.26): >> >> $ xxd -r cp1255-snippet.xxd|iconv -f UTF-8 -t CP1255|iconv -f CP1255 >> -t UTF-8 | xxd >> 00000000: efac b3d6 b8d7 9d0a >> >> Looks good, same as what was sent in. >> >> Using musl-based iconv utility (1.1.19): >> $ xxd -r cp1255-snippet.xxd|$ICONV -f UTF-8 -t CP1255|$ICONV -f CP1255 >> -t UTF-8 | xxd >> 00000000: 2ad6 b8d7 9d0a *..... >> >> Indeed, the result looks different than what was started with: >> >> *=D6=B8=D7=9D >> >> (again apologies if that doesn't survive mailing and such) >> >> This input was taken from gnu libiconv's test suite, in particular the >> first line of tests/CP1255-snippet.UTF-8. Since it's 2 characters, >> and test data, I hope there's no problem re:licensing O:). >> >> I've reproduced the same behavior using iconv() directly, I can share >> that if that would be preferable. It's the same code from earlier >> iconv threads on the ML. > > No need; it's easy to reproduce, and I'm leaning towards saying the > test is invalid. U+FB33 is a precomposed ligature form (from the > Alphabetic Presentation Forms block), roughly equivalent in status to > stuff like "=EF=AC=81" (U+FB01). An iconv implementation could perform an > approximate conversion for such characters, returning a positive value > indicating the number of such substitutions made, but silently > converting it in a lossy way is not conforming, and of there's > apparently no lossless way to convert it since CP1255 has no dedicated > character slot for it (at least based on the definition of the > codepage I'm using). Thanks for looking into this and for the great information! I'll investigate more tomorrow, but wanted to respond to your inquiry since it's easy to produce and might help explain things :). > > Do you know how/why they expect it to round-trip? What does glibc do > when converting it -- can you show the intermediate (CP1255) form as a > hexdump? Sure! Here's the intermediates for libiconv first, then w/musl: $ cat libiconv-cp1255.xxd 00000000: e3cc c8ed 0a ..... $ cat musl-iconv-cp1255.xxd 00000000: 2ac8 ed0a *... Here's what happens when each of these are feed through both: ---- using libiconv's intermediate: $ xxd -r ./libiconv-cp1255.xxd|iconv -f CP1255 -t UTF-8 =EF=AC=B3=D6=B8=D7=9D $ xxd -r ./libiconv-cp1255.xxd|iconv -f CP1255 -t UTF-8|xxd 00000000: efac b3d6 b8d7 9d0a ........ $ xxd -r ./libiconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8 =D7=93=D6=BC=D6=B8=D7=9D $ xxd -r ./libiconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8|xxd 00000000: d793 d6bc d6b8 d79d 0a ......... --- using musl's intermediate *..... $ xxd -r ./musl-iconv-cp1255.xxd|iconv -f CP1255 -t UTF-8 *=D6=B8=D7=9D $ xxd -r ./musl-iconv-cp1255.xxd|iconv -f CP1255 -t UTF-8|xxd 00000000: 2ad6 b8d7 9d0a *..... $ xxd -r ./musl-iconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8 *=D6=B8=D7=9D $ xxd -r ./musl-iconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8|xxd 00000000: 2ad6 b8d7 9d0a For whatever that's worth! :) > >> -------------- >> >> Hopefully this is useful! >> >> On the subject, a question or two if it's not too much trouble: >> >> * is the above what's meant by "round-trip" as discussed in[1]? >> * What sorts of "round-trip" conversions are expected to work? And >> over what inputs should round-trip conversions work-- for any 'valid" >> UTF-8 or so? > > Any UTF-8 whose content is representable in the encoding you're asking > about round-tripping through, i.e. where the first iconv returns 0. Ah, okay. I'll think about this and look into it a bit more. Thanks for answering my questions! ~Will > >> Armed with some insights regarding these questions, I'm hoping to >> scope out something that can be tested or (no promises!) perhaps >> pushed through some formal verification goodness. But also I'm just >> curious :). > > Yay! > > Rich