From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/12826 Path: news.gmane.org!.POSTED!not-for-mail From: Will Dietz Newsgroups: gmane.linux.lib.musl.general Subject: iconv UTF-8 <--> CP1255 roundtrip possible bug? Date: Wed, 16 May 2018 12:22:36 -0500 Message-ID: Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Trace: blaine.gmane.org 1526491251 8518 195.159.176.226 (16 May 2018 17:20:51 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Wed, 16 May 2018 17:20:51 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-12842-gllmg-musl=m.gmane.org@lists.openwall.com Wed May 16 19:20:47 2018 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1fJ06V-00022J-7v for gllmg-musl@m.gmane.org; Wed, 16 May 2018 19:20:45 +0200 Original-Received: (qmail 19667 invoked by uid 550); 16 May 2018 17:22:49 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 19631 invoked from network); 16 May 2018 17:22:49 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wdtz.org; s=google; h=mime-version:from:date:message-id:subject:to :content-transfer-encoding; bh=XaeLcOP2iMeDw18dgXhscoNbYAA/0vikwSNmivLVdFg=; b=cQZQ9thPy5QIP1jmNpDJek2sFlafS9Lh4/fmxwOK4G3nRZoPOkqu0wQYUU/Zl+RM/6 T5v/hfPpZFvMG/Z50gAuKWPejo2R7rlRnvz61uIPJ7XEAGsQ2BYmWvqiW8HR4QKH0crj yfd+7ZhOI2Qjr1MN40btIcVlfI3EmTGhdkwZo= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to :content-transfer-encoding; bh=XaeLcOP2iMeDw18dgXhscoNbYAA/0vikwSNmivLVdFg=; b=Vvo/JOuYhEvfOEmUWYnDutMCrFrahvJ3zWG3SY5bXrpmG3Ni2vwRuEVFsa6y9n73Rn gTceaNNl7F5rC6IJPt56R/nsvzJOPWgO2t1lsAA5su+MiPeVmdp9hbqDUpv+1uCPIaps 1oY1r7GLEyNsCel5yBc4kyRrIXvlL/4qRG+Rg0FHZkXpvjmN0UWxPZu+QSU1nUIY1kWd LghkrM/rq/LJrFZ2csnHTwOfvVbwrueTlgwAf/34RaBzK78ramFmmM7ZzSCh82DrVD/p YpgRmgaNCVo7osHen411IAaFsLdJqThwsQz+gqN+rLv1OEVon4Dwe20JRcwmpjqfyApx BkRg== X-Gm-Message-State: ALKqPwdBUTgCTqFlk608diDfuZNhzckjDyMxDmmGUIQk2LsqkhRaNfzP 1YmgTdNpBFhzDgBTY5BOoKVmxQbFGJz4oF2qEBjAmyZfMA== X-Google-Smtp-Source: AB8JxZrzUD0Mo1gyIGkPYC37pmhSawO3nQWEElz878u7zCqS2UXIytOcos2qCBqpCNAlUUr7wokqKii+BLxEw5ZZjnE= X-Received: by 2002:a54:400a:: with SMTP id x10-v6mr1194565oie.285.1526491356943; Wed, 16 May 2018 10:22:36 -0700 (PDT) X-Originating-IP: [99.4.166.28] Xref: news.gmane.org gmane.linux.lib.musl.general:12826 Archived-At: I admit to being a bit unsure, but the behavior shown below doesn't seem obviously right --LMK if I'm missing something :). Input file attached for inspection without relying on it getting through byte-identical to what I have-- indeed I'm not sure copy+paste into this is working correctly (the characters look different in my terminal :)). Anyway: $ cat cp1255-snippet.xxd 00000000: efac b3d6 b8d7 9d0a ........ $ xxd -r cp1255-snippet.xxd =EF=AC=B3=D6=B8=D7=9D Attempt to round-trip this from UTF-8 to CP1255 and back, first with glibc's iconv (2.26): $ xxd -r cp1255-snippet.xxd|iconv -f UTF-8 -t CP1255|iconv -f CP1255 -t UTF-8 | xxd 00000000: efac b3d6 b8d7 9d0a Looks good, same as what was sent in. Using musl-based iconv utility (1.1.19): $ xxd -r cp1255-snippet.xxd|$ICONV -f UTF-8 -t CP1255|$ICONV -f CP1255 -t UTF-8 | xxd 00000000: 2ad6 b8d7 9d0a *..... Indeed, the result looks different than what was started with: *=D6=B8=D7=9D (again apologies if that doesn't survive mailing and such) This input was taken from gnu libiconv's test suite, in particular the first line of tests/CP1255-snippet.UTF-8. Since it's 2 characters, and test data, I hope there's no problem re:licensing O:). I've reproduced the same behavior using iconv() directly, I can share that if that would be preferable. It's the same code from earlier iconv threads on the ML. -------------- Hopefully this is useful! On the subject, a question or two if it's not too much trouble: * is the above what's meant by "round-trip" as discussed in[1]? * What sorts of "round-trip" conversions are expected to work? And over what inputs should round-trip conversions work-- for any 'valid" UTF-8 or so? Armed with some insights regarding these questions, I'm hoping to scope out something that can be tested or (no promises!) perhaps pushed through some formal verification goodness. But also I'm just curious :). Thanks! ~Will [1] http://www.openwall.com/lists/musl/2018/02/27/2