Re: iconv failure (ISO-2022-JP) since musl update on AlpineLinux

mailing list of musl libc
 help / color / mirror / code / Atom feed

From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Subject: Re: iconv failure (ISO-2022-JP) since musl update on AlpineLinux
Date: Tue, 27 Feb 2018 12:34:26 -0500	[thread overview]
Message-ID: <20180227173426.GQ1436@brightrain.aerifal.cx> (raw)
In-Reply-To: <20180227165704.vyc0m%steffen@sdaoden.eu>

On Tue, Feb 27, 2018 at 05:57:04PM +0100, Steffen Nurpmeso wrote:
> Hello.
> 
> After updating to musl-1.1.19-r0 there i saw test failures for the
> MUA i maintain, namely regarding the mentioned charset.  I will
> attach a file to reproduce.  (Am not subscribed.)
> Ciao!
> 
>   #?0[steffen@devon steffen]$ cksum in.utf 
>   1259742080 686 in.utf
>   #?0[steffen@devon steffen]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum
>   2184132317 536
>   #?0[steffen@devon steffen]$ iconv --version
>   iconv (GNU libiconv 1.11)
> ...
>   #?0[steffen@essex tmp]$ cksum in.utf 
>   1259742080 686 in.utf
>   #?0[steffen@essex tmp]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum 
>   209789743 1736
>   #?0[steffen@essex tmp]$ apk info --who-owns /usr/bin/iconv 
>   /usr/bin/iconv is owned by musl-utils-1.1.19-r0

Does the data round-trip correctly? I don't think you can expect
bitwise match between outputs of different ISO-2022-JP converters,
unless perhaps they both guarantee minimality, because the ISO-2022-JP
representation of a string is highly nonunique.

In particular musl's to-ISO-2022-JP converter is stateless and always
generates shifts in/out around every non-ASCII character. Of course
this is highly suboptimal, but in the worst case (where the caller
calls iconv one character at a time) the iconv API can't do any better
because strings are required to end in the unshifted state, and the
iconv API doesn't have any method to "finalize" a conversion. This
implies that every time iconv returns with non-ASCII as the most
recent output character, it must be followed by a shift back to the
initial (ASCII) state.

We could improve this in the case of batch conversions by overwriting
the previous shift-back-to-initial and skipping the next shift if the
character set of the next character to output matches the previous
one, but that only works within a single batch call, since iconv can't
write outside the buffer passed to it for the current call. This is an
improvement I think I want to make, since it would improve typical
output size a lot, but the cost is output determinism under different
chunking by the caller.

Rich

next prev parent reply	other threads:[~2018-02-27 17:34 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-27 16:57 Steffen Nurpmeso
2018-02-27 17:34 ` Rich Felker [this message]
2018-02-27 19:44 ` Steffen Nurpmeso
2018-02-27 20:19   ` Rich Felker
2018-02-27 21:28     ` Steffen Nurpmeso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180227173426.GQ1436@brightrain.aerifal.cx \
    --to=dalias@libc.org \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).