From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Subject: Re: iconv failure (ISO-2022-JP) since musl update on AlpineLinux
Date: Tue, 27 Feb 2018 12:34:26 -0500 [thread overview]
Message-ID: <20180227173426.GQ1436@brightrain.aerifal.cx> (raw)
In-Reply-To: <20180227165704.vyc0m%steffen@sdaoden.eu>
On Tue, Feb 27, 2018 at 05:57:04PM +0100, Steffen Nurpmeso wrote:
> Hello.
>
> After updating to musl-1.1.19-r0 there i saw test failures for the
> MUA i maintain, namely regarding the mentioned charset. I will
> attach a file to reproduce. (Am not subscribed.)
> Ciao!
>
> #?0[steffen@devon steffen]$ cksum in.utf
> 1259742080 686 in.utf
> #?0[steffen@devon steffen]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum
> 2184132317 536
> #?0[steffen@devon steffen]$ iconv --version
> iconv (GNU libiconv 1.11)
> ...
> #?0[steffen@essex tmp]$ cksum in.utf
> 1259742080 686 in.utf
> #?0[steffen@essex tmp]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum
> 209789743 1736
> #?0[steffen@essex tmp]$ apk info --who-owns /usr/bin/iconv
> /usr/bin/iconv is owned by musl-utils-1.1.19-r0
Does the data round-trip correctly? I don't think you can expect
bitwise match between outputs of different ISO-2022-JP converters,
unless perhaps they both guarantee minimality, because the ISO-2022-JP
representation of a string is highly nonunique.
In particular musl's to-ISO-2022-JP converter is stateless and always
generates shifts in/out around every non-ASCII character. Of course
this is highly suboptimal, but in the worst case (where the caller
calls iconv one character at a time) the iconv API can't do any better
because strings are required to end in the unshifted state, and the
iconv API doesn't have any method to "finalize" a conversion. This
implies that every time iconv returns with non-ASCII as the most
recent output character, it must be followed by a shift back to the
initial (ASCII) state.
We could improve this in the case of batch conversions by overwriting
the previous shift-back-to-initial and skipping the next shift if the
character set of the next character to output matches the previous
one, but that only works within a single batch call, since iconv can't
write outside the buffer passed to it for the current call. This is an
improvement I think I want to make, since it would improve typical
output size a lot, but the cost is output determinism under different
chunking by the caller.
Rich
next prev parent reply other threads:[~2018-02-27 17:34 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-27 16:57 Steffen Nurpmeso
2018-02-27 17:34 ` Rich Felker [this message]
2018-02-27 19:44 ` Steffen Nurpmeso
2018-02-27 20:19 ` Rich Felker
2018-02-27 21:28 ` Steffen Nurpmeso
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180227173426.GQ1436@brightrain.aerifal.cx \
--to=dalias@libc.org \
--cc=musl@lists.openwall.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).