From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/12916 Path: news.gmane.org!.POSTED!not-for-mail From: Will Dietz Newsgroups: gmane.linux.lib.musl.general Subject: Re: iconv UTF-8 <--> CP1255 roundtrip possible bug? Date: Thu, 14 Jun 2018 14:37:48 -0500 Message-ID: References: <20180516230425.GZ1392@brightrain.aerifal.cx> <20180603022635.GV1392@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Trace: blaine.gmane.org 1529004961 750 195.159.176.226 (14 Jun 2018 19:36:01 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Thu, 14 Jun 2018 19:36:01 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-12932-gllmg-musl=m.gmane.org@lists.openwall.com Thu Jun 14 21:35:57 2018 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1fTY2F-0008VY-IL for gllmg-musl@m.gmane.org; Thu, 14 Jun 2018 21:35:55 +0200 Original-Received: (qmail 21793 invoked by uid 550); 14 Jun 2018 19:38:02 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 21720 invoked from network); 14 Jun 2018 19:38:01 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wdtz.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-transfer-encoding; bh=JMLMdrNZPyTMXkbrmu0R8GJ3ckLDz3OBeJ+DPn7uTJU=; b=MOoVa4q3IlJf3vC1/N7s+256CokKFVeGE/T7rTYkByfGkcMWACjrH1XfnaIMS01qFV bLeYqJH+8IQDrDEj8/ToP/YNKVY+M62wZQh9v8nwWxteITTy5KlT1npOr8p3qNZLSeb2 FvBEpj/drenNyuV9Dgc0fmSVYEhIwx2RN6Gls= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-transfer-encoding; bh=JMLMdrNZPyTMXkbrmu0R8GJ3ckLDz3OBeJ+DPn7uTJU=; b=CkWmFKrFMaiIKThO8HH86tFiwp5KRIXGwbbd4akM0YFP4oO3h9I2gEC2L/0rHc7Kf9 EYOIAxSZ0QllIJHzFcXK+U1sW42rKwt0QqQDgISb7UVDD2oIUK9SolJljmfqOgNyURCG f7zjKw5+R77GuD9fIRvPhmcjhSPJLnfyGkw9+PQsjsSEr1RnCjrMsK+FIR2yJd99ZvE8 tL/VE7NzSQGKsdbgJhdd95QcS282hW7aRQ+PbvhY5MYDOiem9WWpHGhDVnjhJpQJKnB1 TalnaF3I5f9g3beD1kgEZV3DN2tBybmloA99tJoCTXZZ1AGLRlnYZvQ4GXuD4Cdxf8Sx Tvpw== X-Gm-Message-State: APt69E3meNZi4lVyEUj3o3RqoJYWmuoMHHLIjAqdlbb/4/5ZS33KHC1M 3YY6Czi3osWAm0+v3th4eQxhf4djJbkOoqgawqOC2Xs= X-Google-Smtp-Source: ADUXVKLAJBi4nGlV/mNhao8ow7CctvJdDyggU2GQdwujh4OpsvdeJ8tZEJOnz9TpscDoqEVblJOJk/B7hSLxuAIDuIY= X-Received: by 2002:aca:698c:: with SMTP id e134-v6mr1507979oic.302.1529005069134; Thu, 14 Jun 2018 12:37:49 -0700 (PDT) X-Originating-IP: [2602:306:304a:61c8::d68] In-Reply-To: <20180603022635.GV1392@brightrain.aerifal.cx> Xref: news.gmane.org gmane.linux.lib.musl.general:12916 Archived-At: Nothing yet, I've not been able to spend more time on this lately sorry :). I'll let you know if I do find anything, and at least I'll be trying the latest changes you've pushed. (yay all issues I've run into appear fixed :D) Thanks! ~Will On Sat, Jun 2, 2018 at 9:26 PM, Rich Felker wrote: > On Wed, May 16, 2018 at 08:48:08PM -0500, Will Dietz wrote: >> On Wed, May 16, 2018 at 6:04 PM, Rich Felker wrote: >> > On Wed, May 16, 2018 at 12:22:36PM -0500, Will Dietz wrote: >> >> I admit to being a bit unsure, but the behavior shown below doesn't >> >> seem obviously right --LMK if I'm missing something :). >> >> >> >> Input file attached for inspection without relying on it getting >> >> through byte-identical to what I have-- >> >> indeed I'm not sure copy+paste into this is working correctly (the >> >> characters look different in my terminal :)). Anyway: >> >> >> >> $ cat cp1255-snippet.xxd >> >> 00000000: efac b3d6 b8d7 9d0a ........ >> >> $ xxd -r cp1255-snippet.xxd >> >> =EF=AC=B3=D6=B8=D7=9D >> >> >> >> Attempt to round-trip this from UTF-8 to CP1255 and back, >> >> first with glibc's iconv (2.26): >> >> >> >> $ xxd -r cp1255-snippet.xxd|iconv -f UTF-8 -t CP1255|iconv -f CP1255 >> >> -t UTF-8 | xxd >> >> 00000000: efac b3d6 b8d7 9d0a >> >> >> >> Looks good, same as what was sent in. >> >> >> >> Using musl-based iconv utility (1.1.19): >> >> $ xxd -r cp1255-snippet.xxd|$ICONV -f UTF-8 -t CP1255|$ICONV -f CP125= 5 >> >> -t UTF-8 | xxd >> >> 00000000: 2ad6 b8d7 9d0a *..... >> >> >> >> Indeed, the result looks different than what was started with: >> >> >> >> *=D6=B8=D7=9D >> >> >> >> (again apologies if that doesn't survive mailing and such) >> >> >> >> This input was taken from gnu libiconv's test suite, in particular th= e >> >> first line of tests/CP1255-snippet.UTF-8. Since it's 2 characters, >> >> and test data, I hope there's no problem re:licensing O:). >> >> >> >> I've reproduced the same behavior using iconv() directly, I can share >> >> that if that would be preferable. It's the same code from earlier >> >> iconv threads on the ML. >> > >> > No need; it's easy to reproduce, and I'm leaning towards saying the >> > test is invalid. U+FB33 is a precomposed ligature form (from the >> > Alphabetic Presentation Forms block), roughly equivalent in status to >> > stuff like "=EF=AC=81" (U+FB01). An iconv implementation could perform= an >> > approximate conversion for such characters, returning a positive value >> > indicating the number of such substitutions made, but silently >> > converting it in a lossy way is not conforming, and of there's >> > apparently no lossless way to convert it since CP1255 has no dedicated >> > character slot for it (at least based on the definition of the >> > codepage I'm using). >> >> Thanks for looking into this and for the great information! >> I'll investigate more tomorrow, but wanted to respond to your inquiry >> since it's easy to produce and might help explain things :). > > Any further findings? > >> > Do you know how/why they expect it to round-trip? What does glibc do >> > when converting it -- can you show the intermediate (CP1255) form as a >> > hexdump? >> >> Sure! >> >> Here's the intermediates for libiconv first, then w/musl: >> >> $ cat libiconv-cp1255.xxd >> 00000000: e3cc c8ed 0a ..... >> $ cat musl-iconv-cp1255.xxd >> 00000000: 2ac8 ed0a *... > > This is a plausible/reasonable conversion GNU iconv is doing... > >> Here's what happens when each of these are feed through both: >> >> ---- using libiconv's intermediate: >> $ xxd -r ./libiconv-cp1255.xxd|iconv -f CP1255 -t UTF-8 >> =EF=AC=B3=D6=B8=D7=9D >> $ xxd -r ./libiconv-cp1255.xxd|iconv -f CP1255 -t UTF-8|xxd >> 00000000: efac b3d6 b8d7 9d0a ........ >> $ xxd -r ./libiconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8 >> =D7=93=D6=BC=D6=B8=D7=9D >> $ xxd -r ./libiconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8|xxd >> 00000000: d793 d6bc d6b8 d79d 0a ......... > > ...but the GNU iconv behavior here is completely unreasonable/wrong. > The first character it outputs is a presentation form for a ligature. > There is no reason iconv should be doing this kind of renormalization > when the original representation as two separate characters is > available in the dest charset. > > Rich