mailing list of musl libc
 help / color / mirror / code / Atom feed
* iconv failure (ISO-2022-JP) since musl update on AlpineLinux
@ 2018-02-27 16:57 Steffen Nurpmeso
  2018-02-27 17:34 ` Rich Felker
  2018-02-27 19:44 ` Steffen Nurpmeso
  0 siblings, 2 replies; 5+ messages in thread
From: Steffen Nurpmeso @ 2018-02-27 16:57 UTC (permalink / raw)
  To: musl; +Cc: Steffen Nurpmeso

[-- Attachment #1: Type: text/plain, Size: 888 bytes --]

Hello.

After updating to musl-1.1.19-r0 there i saw test failures for the
MUA i maintain, namely regarding the mentioned charset.  I will
attach a file to reproduce.  (Am not subscribed.)
Ciao!

  #?0[steffen@devon steffen]$ cksum in.utf 
  1259742080 686 in.utf
  #?0[steffen@devon steffen]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum
  2184132317 536
  #?0[steffen@devon steffen]$ iconv --version
  iconv (GNU libiconv 1.11)
..
  #?0[steffen@essex tmp]$ cksum in.utf 
  1259742080 686 in.utf
  #?0[steffen@essex tmp]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum 
  209789743 1736
  #?0[steffen@essex tmp]$ apk info --who-owns /usr/bin/iconv 
  /usr/bin/iconv is owned by musl-utils-1.1.19-r0

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

[-- Attachment #2: in.utf --]
[-- Type: text/plain, Size: 694 bytes --]

シジュウカラ科(シジュウカラか、学名 Paridae)は、鳥類スズメ目の科である。シジュウカラ(四十雀)と総称されるが、狭義にはこの1種をシジュウカラと呼ぶ。

カンムリガラ(学名Parus cristatus)は、スズメ目シジュウカラ科に分類される鳥類の一種。


カンムリガラ(学名Parus cristatus)は、スズメ目シジュウカラ科に分類される鳥類の一種。

シジュウカラ科(シジュウカラか、学名 Paridae)は、鳥類スズメ目の科である。シジュウカラ(四十雀)と総称されるが、狭義にはこの1種をシジュウカラと呼ぶ。

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: iconv failure (ISO-2022-JP) since musl update on AlpineLinux
  2018-02-27 16:57 iconv failure (ISO-2022-JP) since musl update on AlpineLinux Steffen Nurpmeso
@ 2018-02-27 17:34 ` Rich Felker
  2018-02-27 19:44 ` Steffen Nurpmeso
  1 sibling, 0 replies; 5+ messages in thread
From: Rich Felker @ 2018-02-27 17:34 UTC (permalink / raw)
  To: musl

On Tue, Feb 27, 2018 at 05:57:04PM +0100, Steffen Nurpmeso wrote:
> Hello.
> 
> After updating to musl-1.1.19-r0 there i saw test failures for the
> MUA i maintain, namely regarding the mentioned charset.  I will
> attach a file to reproduce.  (Am not subscribed.)
> Ciao!
> 
>   #?0[steffen@devon steffen]$ cksum in.utf 
>   1259742080 686 in.utf
>   #?0[steffen@devon steffen]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum
>   2184132317 536
>   #?0[steffen@devon steffen]$ iconv --version
>   iconv (GNU libiconv 1.11)
> ...
>   #?0[steffen@essex tmp]$ cksum in.utf 
>   1259742080 686 in.utf
>   #?0[steffen@essex tmp]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum 
>   209789743 1736
>   #?0[steffen@essex tmp]$ apk info --who-owns /usr/bin/iconv 
>   /usr/bin/iconv is owned by musl-utils-1.1.19-r0

Does the data round-trip correctly? I don't think you can expect
bitwise match between outputs of different ISO-2022-JP converters,
unless perhaps they both guarantee minimality, because the ISO-2022-JP
representation of a string is highly nonunique.

In particular musl's to-ISO-2022-JP converter is stateless and always
generates shifts in/out around every non-ASCII character. Of course
this is highly suboptimal, but in the worst case (where the caller
calls iconv one character at a time) the iconv API can't do any better
because strings are required to end in the unshifted state, and the
iconv API doesn't have any method to "finalize" a conversion. This
implies that every time iconv returns with non-ASCII as the most
recent output character, it must be followed by a shift back to the
initial (ASCII) state.

We could improve this in the case of batch conversions by overwriting
the previous shift-back-to-initial and skipping the next shift if the
character set of the next character to output matches the previous
one, but that only works within a single batch call, since iconv can't
write outside the buffer passed to it for the current call. This is an
improvement I think I want to make, since it would improve typical
output size a lot, but the cost is output determinism under different
chunking by the caller.

Rich


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: iconv failure (ISO-2022-JP) since musl update on AlpineLinux
  2018-02-27 16:57 iconv failure (ISO-2022-JP) since musl update on AlpineLinux Steffen Nurpmeso
  2018-02-27 17:34 ` Rich Felker
@ 2018-02-27 19:44 ` Steffen Nurpmeso
  2018-02-27 20:19   ` Rich Felker
  1 sibling, 1 reply; 5+ messages in thread
From: Steffen Nurpmeso @ 2018-02-27 19:44 UTC (permalink / raw)
  To: musl

Hi.

Rich Felker wrote:

sorry i did not get this :)

but i wrote:
 ||After updating to musl-1.1.19-r0 there i saw test failures for the
 ||MUA i maintain, namely regarding the mentioned charset.  I will
 ||attach a file to reproduce.  (Am not subscribed.)
 ...
 ||  #?0[steffen@devon steffen]$ cksum in.utf 
 ||  1259742080 686 in.utf
 ||  #?0[steffen@devon steffen]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum
 ||  2184132317 536
 ||  #?0[steffen@devon steffen]$ iconv --version
 ||  iconv (GNU libiconv 1.11)
 ||..
 ||  #?0[steffen@essex tmp]$ cksum in.utf 
 ||  1259742080 686 in.utf
 ||  #?0[steffen@essex tmp]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum 
 ||  209789743 1736
 ||  #?0[steffen@essex tmp]$ apk info --who-owns /usr/bin/iconv 
 ||  /usr/bin/iconv is owned by musl-utils-1.1.19-r0

 |Does the data round-trip correctly? I don't think you can expect

Ok, i see what you mean, yes, musl iconv(1) can roundtrip.  But..
But for one the error is new (though i actually have forgotten
whether the test ever ran on a musl box or only on BSD and glibc
Linux boxes, but if i recall, it did run, and then it did succeed,
definetely), and then...

 |bitwise match between outputs of different ISO-2022-JP converters,
 |unless perhaps they both guarantee minimality, because the ISO-2022-JP
 |representation of a string is highly nonunique.
 |
 |In particular musl's to-ISO-2022-JP converter is stateless and always
 |generates shifts in/out around every non-ASCII character. Of course
 |this is highly suboptimal, but in the worst case (where the caller
 |calls iconv one character at a time) the iconv API can't do any better
 |because strings are required to end in the unshifted state, and the
 |iconv API doesn't have any method to "finalize" a conversion. This
 |implies that every time iconv returns with non-ASCII as the most
 |recent output character, it must be followed by a shift back to the
 |initial (ASCII) state.
 |
 |We could improve this in the case of batch conversions by overwriting
 |the previous shift-back-to-initial and skipping the next shift if the
 |character set of the next character to output matches the previous
 |one, but that only works within a single batch call, since iconv can't
 |write outside the buffer passed to it for the current call. This is an
 |improvement I think I want to make, since it would improve typical
 |output size a lot, but the cost is output determinism under different
 |chunking by the caller.

Well...  In my cases the MUA fails to convert to ISO-2022-JP at
all, because an iconv(3) error happens.  And when i instrument my
code like

     for(;;){
        size_t sz;

  fprintf(stderr, "iconv(3): in %lu out: %lu\n",*inbleft,*outbleft);
  fprintf(stderr, "     in<%.*s>\n",(int)*inbleft,*inb);
        sz = iconv(cd, __INBCAST(inb), inbleft, outb, outbleft);
        if(sz > 0 && !(icf & n_ICONV_IGN_NOREVERSE)){
  fprintf(stderr, "iconv(3) returned 0x%lX: %s\n",(ul_i)sz,strerror(errno));
           err = n_ERR_NOENT;
           goto jleave;
        }
        if(sz != (size_t)-1)
           break;

then i get

  #?1[steffen@essex nail.git]$ v mae-test-behave_iconv_mbyte_base64-2
  ICONV 2
  iconv(3): in 220 out: 427
       in<シジュウカラ科(シジュウカラか、学名 Paridae)は、鳥類スズメ目の科である。シジュウカラ(四十雀)と総称されるが、狭義にはこの1種をシジュウカラと呼ぶ。
  >
  iconv(3) returned 0xFFFFFFFFFFFFFFFF: Argument list too long
  ICONV 2 err: 2

And that is somehow ooops?  Interestingly if i call iconv(1) only
on these 220 bytes i can roundtrip that, too.  Hmmm.  ...
I thought maybe it is because of the tcc(1) compiler i use, but
i can reproduce this with AlpineLinux gcc(1), too.  I don't know.
Ciao,

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Re: iconv failure (ISO-2022-JP) since musl update on AlpineLinux
  2018-02-27 19:44 ` Steffen Nurpmeso
@ 2018-02-27 20:19   ` Rich Felker
  2018-02-27 21:28     ` Steffen Nurpmeso
  0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2018-02-27 20:19 UTC (permalink / raw)
  To: musl, Steffen Nurpmeso

On Tue, Feb 27, 2018 at 08:44:32PM +0100, Steffen Nurpmeso wrote:
> Hi.
> 
> Rich Felker wrote:
> 
> sorry i did not get this :)

Sorry I neglected to keep you CC'd.

> but i wrote:
>  ||After updating to musl-1.1.19-r0 there i saw test failures for the
>  ||MUA i maintain, namely regarding the mentioned charset.  I will
>  ||attach a file to reproduce.  (Am not subscribed.)
>  ...
>  ||  #?0[steffen@devon steffen]$ cksum in.utf 
>  ||  1259742080 686 in.utf
>  ||  #?0[steffen@devon steffen]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum
>  ||  2184132317 536
>  ||  #?0[steffen@devon steffen]$ iconv --version
>  ||  iconv (GNU libiconv 1.11)
>  ||..
>  ||  #?0[steffen@essex tmp]$ cksum in.utf 
>  ||  1259742080 686 in.utf
>  ||  #?0[steffen@essex tmp]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum 
>  ||  209789743 1736
>  ||  #?0[steffen@essex tmp]$ apk info --who-owns /usr/bin/iconv 
>  ||  /usr/bin/iconv is owned by musl-utils-1.1.19-r0
> 
>  |Does the data round-trip correctly? I don't think you can expect
> 
> Ok, i see what you mean, yes, musl iconv(1) can roundtrip.  But..
> But for one the error is new (though i actually have forgotten
> whether the test ever ran on a musl box or only on BSD and glibc
> Linux boxes, but if i recall, it did run, and then it did succeed,
> definetely), and then...
> 
>  |bitwise match between outputs of different ISO-2022-JP converters,
>  |unless perhaps they both guarantee minimality, because the ISO-2022-JP
>  |representation of a string is highly nonunique.
>  |
>  |In particular musl's to-ISO-2022-JP converter is stateless and always
>  |generates shifts in/out around every non-ASCII character. Of course
>  |this is highly suboptimal, but in the worst case (where the caller
>  |calls iconv one character at a time) the iconv API can't do any better
>  |because strings are required to end in the unshifted state, and the
>  |iconv API doesn't have any method to "finalize" a conversion. This
>  |implies that every time iconv returns with non-ASCII as the most
>  |recent output character, it must be followed by a shift back to the
>  |initial (ASCII) state.
>  |
>  |We could improve this in the case of batch conversions by overwriting
>  |the previous shift-back-to-initial and skipping the next shift if the
>  |character set of the next character to output matches the previous
>  |one, but that only works within a single batch call, since iconv can't
>  |write outside the buffer passed to it for the current call. This is an
>  |improvement I think I want to make, since it would improve typical
>  |output size a lot, but the cost is output determinism under different
>  |chunking by the caller.
> 
> Well...  In my cases the MUA fails to convert to ISO-2022-JP at
> all, because an iconv(3) error happens.  And when i instrument my
> code like
> 
>      for(;;){
>         size_t sz;
> 
>   fprintf(stderr, "iconv(3): in %lu out: %lu\n",*inbleft,*outbleft);
>   fprintf(stderr, "     in<%.*s>\n",(int)*inbleft,*inb);
>         sz = iconv(cd, __INBCAST(inb), inbleft, outb, outbleft);
>         if(sz > 0 && !(icf & n_ICONV_IGN_NOREVERSE)){
>   fprintf(stderr, "iconv(3) returned 0x%lX: %s\n",(ul_i)sz,strerror(errno));
>            err = n_ERR_NOENT;
>            goto jleave;
>         }
>         if(sz != (size_t)-1)
>            break;
> 
> then i get
> 
>   #?1[steffen@essex nail.git]$ v mae-test-behave_iconv_mbyte_base64-2
>   ICONV 2
>   iconv(3): in 220 out: 427
>        in<シジュウカラ科(シジュウカラか、学名 Paridae)は、鳥類スズメ目の科である。シジュウカラ(四十雀)と総称されるが、狭義にはこの1種をシジュウカラと呼ぶ。
>   >
>   iconv(3) returned 0xFFFFFFFFFFFFFFFF: Argument list too long
>   ICONV 2 err: 2
> 
> And that is somehow ooops?  Interestingly if i call iconv(1) only
> on these 220 bytes i can roundtrip that, too.  Hmmm.  ...
> I thought maybe it is because of the tcc(1) compiler i use, but
> i can reproduce this with AlpineLinux gcc(1), too.  I don't know.

I think the test is just using an output buffer that's under the
worst-case size needed for conversion to ISO-2022-JP. The E2BIG error
is specified for "Input conversion stopped due to lack of space in the
output buffer" and is not really an error; is just means the
conversion stopped before reaching the end and you need to resume with
a new buffer for the remainder of the conversion.

Rich


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Re: iconv failure (ISO-2022-JP) since musl update on AlpineLinux
  2018-02-27 20:19   ` Rich Felker
@ 2018-02-27 21:28     ` Steffen Nurpmeso
  0 siblings, 0 replies; 5+ messages in thread
From: Steffen Nurpmeso @ 2018-02-27 21:28 UTC (permalink / raw)
  To: musl

Hello Rich Felker.

Rich Felker <dalias@libc.org> wrote:
 |On Tue, Feb 27, 2018 at 08:44:32PM +0100, Steffen Nurpmeso wrote:
 |> Rich Felker wrote:
 ...
 |> but i wrote:
 |>||After updating to musl-1.1.19-r0 there i saw test failures for the
 |>||MUA i maintain, namely regarding the mentioned charset.  I will
 ..
 |>|Does the data round-trip correctly? I don't think you can expect
 |> 
 |> Ok, i see what you mean, yes, musl iconv(1) can roundtrip.  But..
 ...
 |> Well...  In my cases the MUA fails to convert to ISO-2022-JP at
 |> all, because an iconv(3) error happens.  And when i instrument my
 ..
 |I think the test is just using an output buffer that's under the
 |worst-case size needed for conversion to ISO-2022-JP. The E2BIG error
 |is specified for "Input conversion stopped due to lack of space in the
 |output buffer" and is not really an error; is just means the
 |conversion stopped before reaching the end and you need to resume with
 |a new buffer for the remainder of the conversion.

That iconv(3) wrapper i had hacked into that MUA in 2014 was
indeed complete nonsense and entirely false.  Now corrected.
Thanks for answering the brain damage.
And i will adjust the tests to checksum only the headers and the
roundtrip output of the body content, thanks for pointing this
out.  Be aware you have been credited.
Ciao,

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-02-27 21:28 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-27 16:57 iconv failure (ISO-2022-JP) since musl update on AlpineLinux Steffen Nurpmeso
2018-02-27 17:34 ` Rich Felker
2018-02-27 19:44 ` Steffen Nurpmeso
2018-02-27 20:19   ` Rich Felker
2018-02-27 21:28     ` Steffen Nurpmeso

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).