* iconv failure (ISO-2022-JP) since musl update on AlpineLinux
@ 2018-02-27 16:57 Steffen Nurpmeso
2018-02-27 17:34 ` Rich Felker
2018-02-27 19:44 ` Steffen Nurpmeso
0 siblings, 2 replies; 5+ messages in thread
From: Steffen Nurpmeso @ 2018-02-27 16:57 UTC (permalink / raw)
To: musl; +Cc: Steffen Nurpmeso
[-- Attachment #1: Type: text/plain, Size: 888 bytes --]
Hello.
After updating to musl-1.1.19-r0 there i saw test failures for the
MUA i maintain, namely regarding the mentioned charset. I will
attach a file to reproduce. (Am not subscribed.)
Ciao!
#?0[steffen@devon steffen]$ cksum in.utf
1259742080 686 in.utf
#?0[steffen@devon steffen]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum
2184132317 536
#?0[steffen@devon steffen]$ iconv --version
iconv (GNU libiconv 1.11)
..
#?0[steffen@essex tmp]$ cksum in.utf
1259742080 686 in.utf
#?0[steffen@essex tmp]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum
209789743 1736
#?0[steffen@essex tmp]$ apk info --who-owns /usr/bin/iconv
/usr/bin/iconv is owned by musl-utils-1.1.19-r0
--steffen
|
|Der Kragenbaer, The moon bear,
|der holt sich munter he cheerfully and one by one
|einen nach dem anderen runter wa.ks himself off
|(By Robert Gernhardt)
[-- Attachment #2: in.utf --]
[-- Type: text/plain, Size: 694 bytes --]
シジュウカラ科(シジュウカラか、学名 Paridae)は、鳥類スズメ目の科である。シジュウカラ(四十雀)と総称されるが、狭義にはこの1種をシジュウカラと呼ぶ。
カンムリガラ(学名Parus cristatus)は、スズメ目シジュウカラ科に分類される鳥類の一種。
カンムリガラ(学名Parus cristatus)は、スズメ目シジュウカラ科に分類される鳥類の一種。
シジュウカラ科(シジュウカラか、学名 Paridae)は、鳥類スズメ目の科である。シジュウカラ(四十雀)と総称されるが、狭義にはこの1種をシジュウカラと呼ぶ。
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: iconv failure (ISO-2022-JP) since musl update on AlpineLinux
2018-02-27 16:57 iconv failure (ISO-2022-JP) since musl update on AlpineLinux Steffen Nurpmeso
@ 2018-02-27 17:34 ` Rich Felker
2018-02-27 19:44 ` Steffen Nurpmeso
1 sibling, 0 replies; 5+ messages in thread
From: Rich Felker @ 2018-02-27 17:34 UTC (permalink / raw)
To: musl
On Tue, Feb 27, 2018 at 05:57:04PM +0100, Steffen Nurpmeso wrote:
> Hello.
>
> After updating to musl-1.1.19-r0 there i saw test failures for the
> MUA i maintain, namely regarding the mentioned charset. I will
> attach a file to reproduce. (Am not subscribed.)
> Ciao!
>
> #?0[steffen@devon steffen]$ cksum in.utf
> 1259742080 686 in.utf
> #?0[steffen@devon steffen]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum
> 2184132317 536
> #?0[steffen@devon steffen]$ iconv --version
> iconv (GNU libiconv 1.11)
> ...
> #?0[steffen@essex tmp]$ cksum in.utf
> 1259742080 686 in.utf
> #?0[steffen@essex tmp]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum
> 209789743 1736
> #?0[steffen@essex tmp]$ apk info --who-owns /usr/bin/iconv
> /usr/bin/iconv is owned by musl-utils-1.1.19-r0
Does the data round-trip correctly? I don't think you can expect
bitwise match between outputs of different ISO-2022-JP converters,
unless perhaps they both guarantee minimality, because the ISO-2022-JP
representation of a string is highly nonunique.
In particular musl's to-ISO-2022-JP converter is stateless and always
generates shifts in/out around every non-ASCII character. Of course
this is highly suboptimal, but in the worst case (where the caller
calls iconv one character at a time) the iconv API can't do any better
because strings are required to end in the unshifted state, and the
iconv API doesn't have any method to "finalize" a conversion. This
implies that every time iconv returns with non-ASCII as the most
recent output character, it must be followed by a shift back to the
initial (ASCII) state.
We could improve this in the case of batch conversions by overwriting
the previous shift-back-to-initial and skipping the next shift if the
character set of the next character to output matches the previous
one, but that only works within a single batch call, since iconv can't
write outside the buffer passed to it for the current call. This is an
improvement I think I want to make, since it would improve typical
output size a lot, but the cost is output determinism under different
chunking by the caller.
Rich
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: iconv failure (ISO-2022-JP) since musl update on AlpineLinux
2018-02-27 16:57 iconv failure (ISO-2022-JP) since musl update on AlpineLinux Steffen Nurpmeso
2018-02-27 17:34 ` Rich Felker
@ 2018-02-27 19:44 ` Steffen Nurpmeso
2018-02-27 20:19 ` Rich Felker
1 sibling, 1 reply; 5+ messages in thread
From: Steffen Nurpmeso @ 2018-02-27 19:44 UTC (permalink / raw)
To: musl
Hi.
Rich Felker wrote:
sorry i did not get this :)
but i wrote:
||After updating to musl-1.1.19-r0 there i saw test failures for the
||MUA i maintain, namely regarding the mentioned charset. I will
||attach a file to reproduce. (Am not subscribed.)
...
|| #?0[steffen@devon steffen]$ cksum in.utf
|| 1259742080 686 in.utf
|| #?0[steffen@devon steffen]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum
|| 2184132317 536
|| #?0[steffen@devon steffen]$ iconv --version
|| iconv (GNU libiconv 1.11)
||..
|| #?0[steffen@essex tmp]$ cksum in.utf
|| 1259742080 686 in.utf
|| #?0[steffen@essex tmp]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum
|| 209789743 1736
|| #?0[steffen@essex tmp]$ apk info --who-owns /usr/bin/iconv
|| /usr/bin/iconv is owned by musl-utils-1.1.19-r0
|Does the data round-trip correctly? I don't think you can expect
Ok, i see what you mean, yes, musl iconv(1) can roundtrip. But..
But for one the error is new (though i actually have forgotten
whether the test ever ran on a musl box or only on BSD and glibc
Linux boxes, but if i recall, it did run, and then it did succeed,
definetely), and then...
|bitwise match between outputs of different ISO-2022-JP converters,
|unless perhaps they both guarantee minimality, because the ISO-2022-JP
|representation of a string is highly nonunique.
|
|In particular musl's to-ISO-2022-JP converter is stateless and always
|generates shifts in/out around every non-ASCII character. Of course
|this is highly suboptimal, but in the worst case (where the caller
|calls iconv one character at a time) the iconv API can't do any better
|because strings are required to end in the unshifted state, and the
|iconv API doesn't have any method to "finalize" a conversion. This
|implies that every time iconv returns with non-ASCII as the most
|recent output character, it must be followed by a shift back to the
|initial (ASCII) state.
|
|We could improve this in the case of batch conversions by overwriting
|the previous shift-back-to-initial and skipping the next shift if the
|character set of the next character to output matches the previous
|one, but that only works within a single batch call, since iconv can't
|write outside the buffer passed to it for the current call. This is an
|improvement I think I want to make, since it would improve typical
|output size a lot, but the cost is output determinism under different
|chunking by the caller.
Well... In my cases the MUA fails to convert to ISO-2022-JP at
all, because an iconv(3) error happens. And when i instrument my
code like
for(;;){
size_t sz;
fprintf(stderr, "iconv(3): in %lu out: %lu\n",*inbleft,*outbleft);
fprintf(stderr, " in<%.*s>\n",(int)*inbleft,*inb);
sz = iconv(cd, __INBCAST(inb), inbleft, outb, outbleft);
if(sz > 0 && !(icf & n_ICONV_IGN_NOREVERSE)){
fprintf(stderr, "iconv(3) returned 0x%lX: %s\n",(ul_i)sz,strerror(errno));
err = n_ERR_NOENT;
goto jleave;
}
if(sz != (size_t)-1)
break;
then i get
#?1[steffen@essex nail.git]$ v mae-test-behave_iconv_mbyte_base64-2
ICONV 2
iconv(3): in 220 out: 427
in<シジュウカラ科(シジュウカラか、学名 Paridae)は、鳥類スズメ目の科である。シジュウカラ(四十雀)と総称されるが、狭義にはこの1種をシジュウカラと呼ぶ。
>
iconv(3) returned 0xFFFFFFFFFFFFFFFF: Argument list too long
ICONV 2 err: 2
And that is somehow ooops? Interestingly if i call iconv(1) only
on these 220 bytes i can roundtrip that, too. Hmmm. ...
I thought maybe it is because of the tcc(1) compiler i use, but
i can reproduce this with AlpineLinux gcc(1), too. I don't know.
Ciao,
--steffen
|
|Der Kragenbaer, The moon bear,
|der holt sich munter he cheerfully and one by one
|einen nach dem anderen runter wa.ks himself off
|(By Robert Gernhardt)
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Re: iconv failure (ISO-2022-JP) since musl update on AlpineLinux
2018-02-27 19:44 ` Steffen Nurpmeso
@ 2018-02-27 20:19 ` Rich Felker
2018-02-27 21:28 ` Steffen Nurpmeso
0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2018-02-27 20:19 UTC (permalink / raw)
To: musl, Steffen Nurpmeso
On Tue, Feb 27, 2018 at 08:44:32PM +0100, Steffen Nurpmeso wrote:
> Hi.
>
> Rich Felker wrote:
>
> sorry i did not get this :)
Sorry I neglected to keep you CC'd.
> but i wrote:
> ||After updating to musl-1.1.19-r0 there i saw test failures for the
> ||MUA i maintain, namely regarding the mentioned charset. I will
> ||attach a file to reproduce. (Am not subscribed.)
> ...
> || #?0[steffen@devon steffen]$ cksum in.utf
> || 1259742080 686 in.utf
> || #?0[steffen@devon steffen]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum
> || 2184132317 536
> || #?0[steffen@devon steffen]$ iconv --version
> || iconv (GNU libiconv 1.11)
> ||..
> || #?0[steffen@essex tmp]$ cksum in.utf
> || 1259742080 686 in.utf
> || #?0[steffen@essex tmp]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum
> || 209789743 1736
> || #?0[steffen@essex tmp]$ apk info --who-owns /usr/bin/iconv
> || /usr/bin/iconv is owned by musl-utils-1.1.19-r0
>
> |Does the data round-trip correctly? I don't think you can expect
>
> Ok, i see what you mean, yes, musl iconv(1) can roundtrip. But..
> But for one the error is new (though i actually have forgotten
> whether the test ever ran on a musl box or only on BSD and glibc
> Linux boxes, but if i recall, it did run, and then it did succeed,
> definetely), and then...
>
> |bitwise match between outputs of different ISO-2022-JP converters,
> |unless perhaps they both guarantee minimality, because the ISO-2022-JP
> |representation of a string is highly nonunique.
> |
> |In particular musl's to-ISO-2022-JP converter is stateless and always
> |generates shifts in/out around every non-ASCII character. Of course
> |this is highly suboptimal, but in the worst case (where the caller
> |calls iconv one character at a time) the iconv API can't do any better
> |because strings are required to end in the unshifted state, and the
> |iconv API doesn't have any method to "finalize" a conversion. This
> |implies that every time iconv returns with non-ASCII as the most
> |recent output character, it must be followed by a shift back to the
> |initial (ASCII) state.
> |
> |We could improve this in the case of batch conversions by overwriting
> |the previous shift-back-to-initial and skipping the next shift if the
> |character set of the next character to output matches the previous
> |one, but that only works within a single batch call, since iconv can't
> |write outside the buffer passed to it for the current call. This is an
> |improvement I think I want to make, since it would improve typical
> |output size a lot, but the cost is output determinism under different
> |chunking by the caller.
>
> Well... In my cases the MUA fails to convert to ISO-2022-JP at
> all, because an iconv(3) error happens. And when i instrument my
> code like
>
> for(;;){
> size_t sz;
>
> fprintf(stderr, "iconv(3): in %lu out: %lu\n",*inbleft,*outbleft);
> fprintf(stderr, " in<%.*s>\n",(int)*inbleft,*inb);
> sz = iconv(cd, __INBCAST(inb), inbleft, outb, outbleft);
> if(sz > 0 && !(icf & n_ICONV_IGN_NOREVERSE)){
> fprintf(stderr, "iconv(3) returned 0x%lX: %s\n",(ul_i)sz,strerror(errno));
> err = n_ERR_NOENT;
> goto jleave;
> }
> if(sz != (size_t)-1)
> break;
>
> then i get
>
> #?1[steffen@essex nail.git]$ v mae-test-behave_iconv_mbyte_base64-2
> ICONV 2
> iconv(3): in 220 out: 427
> in<シジュウカラ科(シジュウカラか、学名 Paridae)は、鳥類スズメ目の科である。シジュウカラ(四十雀)と総称されるが、狭義にはこの1種をシジュウカラと呼ぶ。
> >
> iconv(3) returned 0xFFFFFFFFFFFFFFFF: Argument list too long
> ICONV 2 err: 2
>
> And that is somehow ooops? Interestingly if i call iconv(1) only
> on these 220 bytes i can roundtrip that, too. Hmmm. ...
> I thought maybe it is because of the tcc(1) compiler i use, but
> i can reproduce this with AlpineLinux gcc(1), too. I don't know.
I think the test is just using an output buffer that's under the
worst-case size needed for conversion to ISO-2022-JP. The E2BIG error
is specified for "Input conversion stopped due to lack of space in the
output buffer" and is not really an error; is just means the
conversion stopped before reaching the end and you need to resume with
a new buffer for the remainder of the conversion.
Rich
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Re: iconv failure (ISO-2022-JP) since musl update on AlpineLinux
2018-02-27 20:19 ` Rich Felker
@ 2018-02-27 21:28 ` Steffen Nurpmeso
0 siblings, 0 replies; 5+ messages in thread
From: Steffen Nurpmeso @ 2018-02-27 21:28 UTC (permalink / raw)
To: musl
Hello Rich Felker.
Rich Felker <dalias@libc.org> wrote:
|On Tue, Feb 27, 2018 at 08:44:32PM +0100, Steffen Nurpmeso wrote:
|> Rich Felker wrote:
...
|> but i wrote:
|>||After updating to musl-1.1.19-r0 there i saw test failures for the
|>||MUA i maintain, namely regarding the mentioned charset. I will
..
|>|Does the data round-trip correctly? I don't think you can expect
|>
|> Ok, i see what you mean, yes, musl iconv(1) can roundtrip. But..
...
|> Well... In my cases the MUA fails to convert to ISO-2022-JP at
|> all, because an iconv(3) error happens. And when i instrument my
..
|I think the test is just using an output buffer that's under the
|worst-case size needed for conversion to ISO-2022-JP. The E2BIG error
|is specified for "Input conversion stopped due to lack of space in the
|output buffer" and is not really an error; is just means the
|conversion stopped before reaching the end and you need to resume with
|a new buffer for the remainder of the conversion.
That iconv(3) wrapper i had hacked into that MUA in 2014 was
indeed complete nonsense and entirely false. Now corrected.
Thanks for answering the brain damage.
And i will adjust the tests to checksum only the headers and the
roundtrip output of the body content, thanks for pointing this
out. Be aware you have been credited.
Ciao,
--steffen
|
|Der Kragenbaer, The moon bear,
|der holt sich munter he cheerfully and one by one
|einen nach dem anderen runter wa.ks himself off
|(By Robert Gernhardt)
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2018-02-27 21:28 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-27 16:57 iconv failure (ISO-2022-JP) since musl update on AlpineLinux Steffen Nurpmeso
2018-02-27 17:34 ` Rich Felker
2018-02-27 19:44 ` Steffen Nurpmeso
2018-02-27 20:19 ` Rich Felker
2018-02-27 21:28 ` Steffen Nurpmeso
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).