* iconv failure (ISO-2022-JP) since musl update on AlpineLinux @ 2018-02-27 16:57 Steffen Nurpmeso 2018-02-27 17:34 ` Rich Felker 2018-02-27 19:44 ` Steffen Nurpmeso 0 siblings, 2 replies; 5+ messages in thread From: Steffen Nurpmeso @ 2018-02-27 16:57 UTC (permalink / raw) To: musl; +Cc: Steffen Nurpmeso [-- Attachment #1: Type: text/plain, Size: 888 bytes --] Hello. After updating to musl-1.1.19-r0 there i saw test failures for the MUA i maintain, namely regarding the mentioned charset. I will attach a file to reproduce. (Am not subscribed.) Ciao! #?0[steffen@devon steffen]$ cksum in.utf 1259742080 686 in.utf #?0[steffen@devon steffen]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum 2184132317 536 #?0[steffen@devon steffen]$ iconv --version iconv (GNU libiconv 1.11) .. #?0[steffen@essex tmp]$ cksum in.utf 1259742080 686 in.utf #?0[steffen@essex tmp]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum 209789743 1736 #?0[steffen@essex tmp]$ apk info --who-owns /usr/bin/iconv /usr/bin/iconv is owned by musl-utils-1.1.19-r0 --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) [-- Attachment #2: in.utf --] [-- Type: text/plain, Size: 694 bytes --] シジュウカラ科(シジュウカラか、学名 Paridae)は、鳥類スズメ目の科である。シジュウカラ(四十雀)と総称されるが、狭義にはこの1種をシジュウカラと呼ぶ。 カンムリガラ(学名Parus cristatus)は、スズメ目シジュウカラ科に分類される鳥類の一種。 カンムリガラ(学名Parus cristatus)は、スズメ目シジュウカラ科に分類される鳥類の一種。 シジュウカラ科(シジュウカラか、学名 Paridae)は、鳥類スズメ目の科である。シジュウカラ(四十雀)と総称されるが、狭義にはこの1種をシジュウカラと呼ぶ。 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: iconv failure (ISO-2022-JP) since musl update on AlpineLinux 2018-02-27 16:57 iconv failure (ISO-2022-JP) since musl update on AlpineLinux Steffen Nurpmeso @ 2018-02-27 17:34 ` Rich Felker 2018-02-27 19:44 ` Steffen Nurpmeso 1 sibling, 0 replies; 5+ messages in thread From: Rich Felker @ 2018-02-27 17:34 UTC (permalink / raw) To: musl On Tue, Feb 27, 2018 at 05:57:04PM +0100, Steffen Nurpmeso wrote: > Hello. > > After updating to musl-1.1.19-r0 there i saw test failures for the > MUA i maintain, namely regarding the mentioned charset. I will > attach a file to reproduce. (Am not subscribed.) > Ciao! > > #?0[steffen@devon steffen]$ cksum in.utf > 1259742080 686 in.utf > #?0[steffen@devon steffen]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum > 2184132317 536 > #?0[steffen@devon steffen]$ iconv --version > iconv (GNU libiconv 1.11) > ... > #?0[steffen@essex tmp]$ cksum in.utf > 1259742080 686 in.utf > #?0[steffen@essex tmp]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum > 209789743 1736 > #?0[steffen@essex tmp]$ apk info --who-owns /usr/bin/iconv > /usr/bin/iconv is owned by musl-utils-1.1.19-r0 Does the data round-trip correctly? I don't think you can expect bitwise match between outputs of different ISO-2022-JP converters, unless perhaps they both guarantee minimality, because the ISO-2022-JP representation of a string is highly nonunique. In particular musl's to-ISO-2022-JP converter is stateless and always generates shifts in/out around every non-ASCII character. Of course this is highly suboptimal, but in the worst case (where the caller calls iconv one character at a time) the iconv API can't do any better because strings are required to end in the unshifted state, and the iconv API doesn't have any method to "finalize" a conversion. This implies that every time iconv returns with non-ASCII as the most recent output character, it must be followed by a shift back to the initial (ASCII) state. We could improve this in the case of batch conversions by overwriting the previous shift-back-to-initial and skipping the next shift if the character set of the next character to output matches the previous one, but that only works within a single batch call, since iconv can't write outside the buffer passed to it for the current call. This is an improvement I think I want to make, since it would improve typical output size a lot, but the cost is output determinism under different chunking by the caller. Rich ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: iconv failure (ISO-2022-JP) since musl update on AlpineLinux 2018-02-27 16:57 iconv failure (ISO-2022-JP) since musl update on AlpineLinux Steffen Nurpmeso 2018-02-27 17:34 ` Rich Felker @ 2018-02-27 19:44 ` Steffen Nurpmeso 2018-02-27 20:19 ` Rich Felker 1 sibling, 1 reply; 5+ messages in thread From: Steffen Nurpmeso @ 2018-02-27 19:44 UTC (permalink / raw) To: musl Hi. Rich Felker wrote: sorry i did not get this :) but i wrote: ||After updating to musl-1.1.19-r0 there i saw test failures for the ||MUA i maintain, namely regarding the mentioned charset. I will ||attach a file to reproduce. (Am not subscribed.) ... || #?0[steffen@devon steffen]$ cksum in.utf || 1259742080 686 in.utf || #?0[steffen@devon steffen]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum || 2184132317 536 || #?0[steffen@devon steffen]$ iconv --version || iconv (GNU libiconv 1.11) ||.. || #?0[steffen@essex tmp]$ cksum in.utf || 1259742080 686 in.utf || #?0[steffen@essex tmp]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum || 209789743 1736 || #?0[steffen@essex tmp]$ apk info --who-owns /usr/bin/iconv || /usr/bin/iconv is owned by musl-utils-1.1.19-r0 |Does the data round-trip correctly? I don't think you can expect Ok, i see what you mean, yes, musl iconv(1) can roundtrip. But.. But for one the error is new (though i actually have forgotten whether the test ever ran on a musl box or only on BSD and glibc Linux boxes, but if i recall, it did run, and then it did succeed, definetely), and then... |bitwise match between outputs of different ISO-2022-JP converters, |unless perhaps they both guarantee minimality, because the ISO-2022-JP |representation of a string is highly nonunique. | |In particular musl's to-ISO-2022-JP converter is stateless and always |generates shifts in/out around every non-ASCII character. Of course |this is highly suboptimal, but in the worst case (where the caller |calls iconv one character at a time) the iconv API can't do any better |because strings are required to end in the unshifted state, and the |iconv API doesn't have any method to "finalize" a conversion. This |implies that every time iconv returns with non-ASCII as the most |recent output character, it must be followed by a shift back to the |initial (ASCII) state. | |We could improve this in the case of batch conversions by overwriting |the previous shift-back-to-initial and skipping the next shift if the |character set of the next character to output matches the previous |one, but that only works within a single batch call, since iconv can't |write outside the buffer passed to it for the current call. This is an |improvement I think I want to make, since it would improve typical |output size a lot, but the cost is output determinism under different |chunking by the caller. Well... In my cases the MUA fails to convert to ISO-2022-JP at all, because an iconv(3) error happens. And when i instrument my code like for(;;){ size_t sz; fprintf(stderr, "iconv(3): in %lu out: %lu\n",*inbleft,*outbleft); fprintf(stderr, " in<%.*s>\n",(int)*inbleft,*inb); sz = iconv(cd, __INBCAST(inb), inbleft, outb, outbleft); if(sz > 0 && !(icf & n_ICONV_IGN_NOREVERSE)){ fprintf(stderr, "iconv(3) returned 0x%lX: %s\n",(ul_i)sz,strerror(errno)); err = n_ERR_NOENT; goto jleave; } if(sz != (size_t)-1) break; then i get #?1[steffen@essex nail.git]$ v mae-test-behave_iconv_mbyte_base64-2 ICONV 2 iconv(3): in 220 out: 427 in<シジュウカラ科(シジュウカラか、学名 Paridae)は、鳥類スズメ目の科である。シジュウカラ(四十雀)と総称されるが、狭義にはこの1種をシジュウカラと呼ぶ。 > iconv(3) returned 0xFFFFFFFFFFFFFFFF: Argument list too long ICONV 2 err: 2 And that is somehow ooops? Interestingly if i call iconv(1) only on these 220 bytes i can roundtrip that, too. Hmmm. ... I thought maybe it is because of the tcc(1) compiler i use, but i can reproduce this with AlpineLinux gcc(1), too. I don't know. Ciao, --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Re: iconv failure (ISO-2022-JP) since musl update on AlpineLinux 2018-02-27 19:44 ` Steffen Nurpmeso @ 2018-02-27 20:19 ` Rich Felker 2018-02-27 21:28 ` Steffen Nurpmeso 0 siblings, 1 reply; 5+ messages in thread From: Rich Felker @ 2018-02-27 20:19 UTC (permalink / raw) To: musl, Steffen Nurpmeso On Tue, Feb 27, 2018 at 08:44:32PM +0100, Steffen Nurpmeso wrote: > Hi. > > Rich Felker wrote: > > sorry i did not get this :) Sorry I neglected to keep you CC'd. > but i wrote: > ||After updating to musl-1.1.19-r0 there i saw test failures for the > ||MUA i maintain, namely regarding the mentioned charset. I will > ||attach a file to reproduce. (Am not subscribed.) > ... > || #?0[steffen@devon steffen]$ cksum in.utf > || 1259742080 686 in.utf > || #?0[steffen@devon steffen]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum > || 2184132317 536 > || #?0[steffen@devon steffen]$ iconv --version > || iconv (GNU libiconv 1.11) > ||.. > || #?0[steffen@essex tmp]$ cksum in.utf > || 1259742080 686 in.utf > || #?0[steffen@essex tmp]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum > || 209789743 1736 > || #?0[steffen@essex tmp]$ apk info --who-owns /usr/bin/iconv > || /usr/bin/iconv is owned by musl-utils-1.1.19-r0 > > |Does the data round-trip correctly? I don't think you can expect > > Ok, i see what you mean, yes, musl iconv(1) can roundtrip. But.. > But for one the error is new (though i actually have forgotten > whether the test ever ran on a musl box or only on BSD and glibc > Linux boxes, but if i recall, it did run, and then it did succeed, > definetely), and then... > > |bitwise match between outputs of different ISO-2022-JP converters, > |unless perhaps they both guarantee minimality, because the ISO-2022-JP > |representation of a string is highly nonunique. > | > |In particular musl's to-ISO-2022-JP converter is stateless and always > |generates shifts in/out around every non-ASCII character. Of course > |this is highly suboptimal, but in the worst case (where the caller > |calls iconv one character at a time) the iconv API can't do any better > |because strings are required to end in the unshifted state, and the > |iconv API doesn't have any method to "finalize" a conversion. This > |implies that every time iconv returns with non-ASCII as the most > |recent output character, it must be followed by a shift back to the > |initial (ASCII) state. > | > |We could improve this in the case of batch conversions by overwriting > |the previous shift-back-to-initial and skipping the next shift if the > |character set of the next character to output matches the previous > |one, but that only works within a single batch call, since iconv can't > |write outside the buffer passed to it for the current call. This is an > |improvement I think I want to make, since it would improve typical > |output size a lot, but the cost is output determinism under different > |chunking by the caller. > > Well... In my cases the MUA fails to convert to ISO-2022-JP at > all, because an iconv(3) error happens. And when i instrument my > code like > > for(;;){ > size_t sz; > > fprintf(stderr, "iconv(3): in %lu out: %lu\n",*inbleft,*outbleft); > fprintf(stderr, " in<%.*s>\n",(int)*inbleft,*inb); > sz = iconv(cd, __INBCAST(inb), inbleft, outb, outbleft); > if(sz > 0 && !(icf & n_ICONV_IGN_NOREVERSE)){ > fprintf(stderr, "iconv(3) returned 0x%lX: %s\n",(ul_i)sz,strerror(errno)); > err = n_ERR_NOENT; > goto jleave; > } > if(sz != (size_t)-1) > break; > > then i get > > #?1[steffen@essex nail.git]$ v mae-test-behave_iconv_mbyte_base64-2 > ICONV 2 > iconv(3): in 220 out: 427 > in<シジュウカラ科(シジュウカラか、学名 Paridae)は、鳥類スズメ目の科である。シジュウカラ(四十雀)と総称されるが、狭義にはこの1種をシジュウカラと呼ぶ。 > > > iconv(3) returned 0xFFFFFFFFFFFFFFFF: Argument list too long > ICONV 2 err: 2 > > And that is somehow ooops? Interestingly if i call iconv(1) only > on these 220 bytes i can roundtrip that, too. Hmmm. ... > I thought maybe it is because of the tcc(1) compiler i use, but > i can reproduce this with AlpineLinux gcc(1), too. I don't know. I think the test is just using an output buffer that's under the worst-case size needed for conversion to ISO-2022-JP. The E2BIG error is specified for "Input conversion stopped due to lack of space in the output buffer" and is not really an error; is just means the conversion stopped before reaching the end and you need to resume with a new buffer for the remainder of the conversion. Rich ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Re: iconv failure (ISO-2022-JP) since musl update on AlpineLinux 2018-02-27 20:19 ` Rich Felker @ 2018-02-27 21:28 ` Steffen Nurpmeso 0 siblings, 0 replies; 5+ messages in thread From: Steffen Nurpmeso @ 2018-02-27 21:28 UTC (permalink / raw) To: musl Hello Rich Felker. Rich Felker <dalias@libc.org> wrote: |On Tue, Feb 27, 2018 at 08:44:32PM +0100, Steffen Nurpmeso wrote: |> Rich Felker wrote: ... |> but i wrote: |>||After updating to musl-1.1.19-r0 there i saw test failures for the |>||MUA i maintain, namely regarding the mentioned charset. I will .. |>|Does the data round-trip correctly? I don't think you can expect |> |> Ok, i see what you mean, yes, musl iconv(1) can roundtrip. But.. ... |> Well... In my cases the MUA fails to convert to ISO-2022-JP at |> all, because an iconv(3) error happens. And when i instrument my .. |I think the test is just using an output buffer that's under the |worst-case size needed for conversion to ISO-2022-JP. The E2BIG error |is specified for "Input conversion stopped due to lack of space in the |output buffer" and is not really an error; is just means the |conversion stopped before reaching the end and you need to resume with |a new buffer for the remainder of the conversion. That iconv(3) wrapper i had hacked into that MUA in 2014 was indeed complete nonsense and entirely false. Now corrected. Thanks for answering the brain damage. And i will adjust the tests to checksum only the headers and the roundtrip output of the body content, thanks for pointing this out. Be aware you have been credited. Ciao, --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2018-02-27 21:28 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-02-27 16:57 iconv failure (ISO-2022-JP) since musl update on AlpineLinux Steffen Nurpmeso 2018-02-27 17:34 ` Rich Felker 2018-02-27 19:44 ` Steffen Nurpmeso 2018-02-27 20:19 ` Rich Felker 2018-02-27 21:28 ` Steffen Nurpmeso
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).