* Issues with iconv conversions (UTF-8 -> cp*) @ 2017-05-03 17:20 maksis . 2017-05-03 17:53 ` Rich Felker 0 siblings, 1 reply; 6+ messages in thread From: maksis . @ 2017-05-03 17:20 UTC (permalink / raw) To: musl@lists.openwall.com <musl@lists.openwall.com> [-- Attachment #1: Type: text/plain, Size: 705 bytes --] Hi, I’m experiencing issues with iconv conversions from UTF-8 to Windows codepages (cp* -> UTF-8 seems to be working fine). Test program: https://gist.github.com/maksis/ef6562b43c94a6a29dc21b987ec4c0cf gcc output: UTF-8 -> cp1250 -> UTF-8: ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚„…†‡‰Š‹ŚŤŽŹ‘’“”•–—™š›śťžźˇ˘Ł¤Ą¦§¨©Ş«¬®Ż°±˛ł´µ¶·¸ąş»Ľ˝ľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ musl-gcc output: UTF-8 -> cp1250 -> UTF-8: ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚„…†‡‰Š‹ŚŤŽŹ‘’“”•–—™š›śťžźˇ˘Ł*Ą****Ş***Ż**˛ł*****ąş*Ľ˝ľżŔ**Ă*ĹĆ*Č*Ę*Ě**ĎĐŃŇ**Ő**ŘŮ*Ű**Ţ*ŕ**ă*ĺć*č*ę*ě**ďđńň**ő**řů*ű**ţ Tested with GCC 6.3.1 and musl 1.1.16 [-- Attachment #2: Type: text/html, Size: 2517 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Issues with iconv conversions (UTF-8 -> cp*) 2017-05-03 17:20 Issues with iconv conversions (UTF-8 -> cp*) maksis . @ 2017-05-03 17:53 ` Rich Felker 2017-05-03 20:57 ` Rich Felker 0 siblings, 1 reply; 6+ messages in thread From: Rich Felker @ 2017-05-03 17:53 UTC (permalink / raw) To: musl On Wed, May 03, 2017 at 05:20:47PM +0000, maksis . wrote: > Hi, > > I’m experiencing issues with iconv conversions from UTF-8 to Windows codepages (cp* -> UTF-8 seems to be working fine). > > > Test program: https://gist.github.com/maksis/ef6562b43c94a6a29dc21b987ec4c0cf > > gcc output: UTF-8 -> cp1250 -> UTF-8: ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚„…†‡‰Š‹ŚŤŽŹ‘’“”•–—™š›śťžźˇ˘Ł¤Ą¦§¨©Ş«¬®Ż°±˛ł´µ¶·¸ąş»Ľ˝ľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ > > musl-gcc output: UTF-8 -> cp1250 -> UTF-8: ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚„…†‡‰Š‹ŚŤŽŹ‘’“”•–—™š›śťžźˇ˘Ł*Ą****Ş***Ż**˛ł*****ąş*Ľ˝ľżŔ**Ă*ĹĆ*Č*Ę*Ě**ĎĐŃŇ**Ő**ŘŮ*Ű**Ţ*ŕ**ă*ĺć*č*ę*ě**ďđńň**ő**řů*ű**ţ > > Tested with GCC 6.3.1 and musl 1.1.16 Thanks. I've confirmed that this is a bug in conversion to (not from, only to) legacy 8bit codepages; there's missing logic for the case (which is handled specially in the tables) where a unicode codepoint is represented by its own value (that fits in 8 bits). I'll work on a patch to fix it and follow up when it's either committed or ready for testing. Rich ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Issues with iconv conversions (UTF-8 -> cp*) 2017-05-03 17:53 ` Rich Felker @ 2017-05-03 20:57 ` Rich Felker 2017-05-04 9:23 ` maksis . 0 siblings, 1 reply; 6+ messages in thread From: Rich Felker @ 2017-05-03 20:57 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 1918 bytes --] On Wed, May 03, 2017 at 01:53:19PM -0400, Rich Felker wrote: > On Wed, May 03, 2017 at 05:20:47PM +0000, maksis . wrote: > > Hi, > > > > I’m experiencing issues with iconv conversions from UTF-8 to Windows codepages (cp* -> UTF-8 seems to be working fine). > > > > > > Test program: https://gist.github.com/maksis/ef6562b43c94a6a29dc21b987ec4c0cf > > > > gcc output: UTF-8 -> cp1250 -> UTF-8: ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚„…†‡‰Š‹ŚŤŽŹ‘’“”•–—™š›śťžźˇ˘Ł¤Ą¦§¨©Ş«¬®Ż°±˛ł´µ¶·¸ąş»Ľ˝ľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ > > > > musl-gcc output: UTF-8 -> cp1250 -> UTF-8: ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚„…†‡‰Š‹ŚŤŽŹ‘’“”•–—™š›śťžźˇ˘Ł*Ą****Ş***Ż**˛ł*****ąş*Ľ˝ľżŔ**Ă*ĹĆ*Č*Ę*Ě**ĎĐŃŇ**Ő**ŘŮ*Ű**Ţ*ŕ**ă*ĺć*č*ę*ě**ďđńň**ő**řů*ű**ţ > > > > Tested with GCC 6.3.1 and musl 1.1.16 > > Thanks. I've confirmed that this is a bug in conversion to (not from, > only to) legacy 8bit codepages; there's missing logic for the case > (which is handled specially in the tables) where a unicode codepoint > is represented by its own value (that fits in 8 bits). I'll work on a > patch to fix it and follow up when it's either committed or ready for > testing. Does the attached patch work for you? Anyone see problems with it? The functional change is at: - if (c < 128+totype) { + if (c < 128+totype || c==legacy_map(tomap, c)) { The rest is just refactoring to put the shared logic in a new static function legacy_map rather than repeating it two (and now three, after the fix) times. Further simplification via refactoring is actually possible but I tried to keep this patch short and easy to review. Rich [-- Attachment #2: iconv_tolegacy_fix.diff --] [-- Type: text/plain, Size: 1546 bytes --] diff --git a/src/locale/iconv.c b/src/locale/iconv.c index 1eeea94..5eb1843 100644 --- a/src/locale/iconv.c +++ b/src/locale/iconv.c @@ -151,6 +151,14 @@ static void put_32(unsigned char *s, unsigned c, int e) #define mbrtowc_utf8 mbrtowc #define wctomb_utf8 wctomb +static unsigned legacy_map(const unsigned char *map, unsigned c) +{ + unsigned x = c - 128 + map[-1]; + x = legacy_chars[ map[x*5/4]>>2*x%8 | + map[x*5/4+1]<<8-2*x%8 & 1023 ]; + return x ? x : c; +} + size_t iconv(iconv_t cd0, char **restrict in, size_t *restrict inb, char **restrict out, size_t *restrict outb) { size_t x=0; @@ -364,10 +372,7 @@ size_t iconv(iconv_t cd0, char **restrict in, size_t *restrict inb, char **restr break; default: if (c < 128+type) break; - c -= 128+type; - c = legacy_chars[ map[c*5/4]>>2*c%8 | - map[c*5/4+1]<<8-2*c%8 & 1023 ]; - if (!c) c = *(unsigned char *)*in; + c = legacy_map(map, c); if (c==1) goto ilseq; } @@ -392,17 +397,15 @@ size_t iconv(iconv_t cd0, char **restrict in, size_t *restrict inb, char **restr if (c > 0x7f) subst: x++, c='*'; default: if (*outb < 1) goto toobig; - if (c < 128+totype) { + if (c < 128+totype || c==legacy_map(tomap, c)) { revout: *(*out)++ = c; *outb -= 1; break; } d = c; - for (c=0; c<128-totype; c++) { - if (d == legacy_chars[ tomap[c*5/4]>>2*c%8 | - tomap[c*5/4+1]<<8-2*c%8 & 1023 ]) { - c += 128; + for (c=128+totype; c<256; c++) { + if (d == legacy_map(tomap, c)) { goto revout; } } ^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: Issues with iconv conversions (UTF-8 -> cp*) 2017-05-03 20:57 ` Rich Felker @ 2017-05-04 9:23 ` maksis . 2017-05-04 14:07 ` Rich Felker 0 siblings, 1 reply; 6+ messages in thread From: maksis . @ 2017-05-04 9:23 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 2155 bytes --] Seems like it’s better, but certain characters are converted incorrectly by the linked test app (such as “€” -> “¬”). I also have different characters failing in my actual C++ app. From: Rich Felker<mailto:dalias@libc.org> Sent: torstai 4. toukokuuta 2017 0.00 To: musl@lists.openwall.com<mailto:musl@lists.openwall.com> Subject: Re: [musl] Issues with iconv conversions (UTF-8 -> cp*) On Wed, May 03, 2017 at 01:53:19PM -0400, Rich Felker wrote: > On Wed, May 03, 2017 at 05:20:47PM +0000, maksis . wrote: > > Hi, > > > > I’m experiencing issues with iconv conversions from UTF-8 to Windows codepages (cp* -> UTF-8 seems to be working fine). > > > > > > Test program: https://gist.github.com/maksis/ef6562b43c94a6a29dc21b987ec4c0cf > > > > gcc output: UTF-8 -> cp1250 -> UTF-8: ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚„…†‡‰Š‹ŚŤŽŹ‘’“”•–—™š›śťžźˇ˘Ł¤Ą¦§¨©Ş«¬®Ż°±˛ł´µ¶·¸ąş»Ľ˝ľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ > > > > musl-gcc output: UTF-8 -> cp1250 -> UTF-8: ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚„…†‡‰Š‹ŚŤŽŹ‘’“”•–—™š›śťžźˇ˘Ł*Ą****Ş***Ż**˛ł*****ąş*Ľ˝ľżŔ**Ă*ĹĆ*Č*Ę*Ě**ĎĐŃŇ**Ő**ŘŮ*Ű**Ţ*ŕ**ă*ĺć*č*ę*ě**ďđńň**ő**řů*ű**ţ > > > > Tested with GCC 6.3.1 and musl 1.1.16 > > Thanks. I've confirmed that this is a bug in conversion to (not from, > only to) legacy 8bit codepages; there's missing logic for the case > (which is handled specially in the tables) where a unicode codepoint > is represented by its own value (that fits in 8 bits). I'll work on a > patch to fix it and follow up when it's either committed or ready for > testing. Does the attached patch work for you? Anyone see problems with it? The functional change is at: - if (c < 128+totype) { + if (c < 128+totype || c==legacy_map(tomap, c)) { The rest is just refactoring to put the shared logic in a new static function legacy_map rather than repeating it two (and now three, after the fix) times. Further simplification via refactoring is actually possible but I tried to keep this patch short and easy to review. Rich [-- Attachment #2: Type: text/html, Size: 4666 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Issues with iconv conversions (UTF-8 -> cp*) 2017-05-04 9:23 ` maksis . @ 2017-05-04 14:07 ` Rich Felker 2017-05-04 14:49 ` maksis . 0 siblings, 1 reply; 6+ messages in thread From: Rich Felker @ 2017-05-04 14:07 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 2796 bytes --] On Thu, May 04, 2017 at 09:23:18AM +0000, maksis . wrote: > Seems like it’s better, but certain characters are converted > incorrectly by the linked test app (such as “€” -> “¬”). I also have > different characters failing in my actual C++ app. I can't reproduce this. With the patch applied I have both “€” and “¬” successfully round-tripping from UTF-8 to CP1250 and back... ..however that makes sense because the bug was out-of-bounds memory reads, whose behavior would vary by build/system/etc. (UB). Try this v2 patch. Rich > From: Rich Felker<mailto:dalias@libc.org> > Sent: torstai 4. toukokuuta 2017 0.00 > To: musl@lists.openwall.com<mailto:musl@lists.openwall.com> > Subject: Re: [musl] Issues with iconv conversions (UTF-8 -> cp*) > > On Wed, May 03, 2017 at 01:53:19PM -0400, Rich Felker wrote: > > On Wed, May 03, 2017 at 05:20:47PM +0000, maksis . wrote: > > > Hi, > > > > > > I’m experiencing issues with iconv conversions from UTF-8 to Windows codepages (cp* -> UTF-8 seems to be working fine). > > > > > > > > > Test program: https://gist.github.com/maksis/ef6562b43c94a6a29dc21b987ec4c0cf > > > > > > gcc output: UTF-8 -> cp1250 -> UTF-8: ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚„…†‡‰Š‹ŚŤŽŹ‘’“”•–—™š›śťžźˇ˘Ł¤Ą¦§¨©Ş«¬®Ż°±˛ł´µ¶·¸ąş»Ľ˝ľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ > > > > > > musl-gcc output: UTF-8 -> cp1250 -> UTF-8: ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚„…†‡‰Š‹ŚŤŽŹ‘’“”•–—™š›śťžźˇ˘Ł*Ą****Ş***Ż**˛ł*****ąş*Ľ˝ľżŔ**Ă*ĹĆ*Č*Ę*Ě**ĎĐŃŇ**Ő**ŘŮ*Ű**Ţ*ŕ**ă*ĺć*č*ę*ě**ďđńň**ő**řů*ű**ţ > > > > > > Tested with GCC 6.3.1 and musl 1.1.16 > > > > Thanks. I've confirmed that this is a bug in conversion to (not from, > > only to) legacy 8bit codepages; there's missing logic for the case > > (which is handled specially in the tables) where a unicode codepoint > > is represented by its own value (that fits in 8 bits). I'll work on a > > patch to fix it and follow up when it's either committed or ready for > > testing. > > Does the attached patch work for you? Anyone see problems with it? The > functional change is at: > > - if (c < 128+totype) { > + if (c < 128+totype || c==legacy_map(tomap, c)) { > > The rest is just refactoring to put the shared logic in a new static > function legacy_map rather than repeating it two (and now three, after > the fix) times. Further simplification via refactoring is actually > possible but I tried to keep this patch short and easy to review. > > Rich > [-- Attachment #2: iconv_tolegacy_fix_v2.diff --] [-- Type: text/plain, Size: 1557 bytes --] diff --git a/src/locale/iconv.c b/src/locale/iconv.c index 1eeea94..4636307 100644 --- a/src/locale/iconv.c +++ b/src/locale/iconv.c @@ -151,6 +151,14 @@ static void put_32(unsigned char *s, unsigned c, int e) #define mbrtowc_utf8 mbrtowc #define wctomb_utf8 wctomb +static unsigned legacy_map(const unsigned char *map, unsigned c) +{ + unsigned x = c - 128 + map[-1]; + x = legacy_chars[ map[x*5/4]>>2*x%8 | + map[x*5/4+1]<<8-2*x%8 & 1023 ]; + return x ? x : c; +} + size_t iconv(iconv_t cd0, char **restrict in, size_t *restrict inb, char **restrict out, size_t *restrict outb) { size_t x=0; @@ -364,10 +372,7 @@ size_t iconv(iconv_t cd0, char **restrict in, size_t *restrict inb, char **restr break; default: if (c < 128+type) break; - c -= 128+type; - c = legacy_chars[ map[c*5/4]>>2*c%8 | - map[c*5/4+1]<<8-2*c%8 & 1023 ]; - if (!c) c = *(unsigned char *)*in; + c = legacy_map(map, c); if (c==1) goto ilseq; } @@ -392,17 +397,15 @@ size_t iconv(iconv_t cd0, char **restrict in, size_t *restrict inb, char **restr if (c > 0x7f) subst: x++, c='*'; default: if (*outb < 1) goto toobig; - if (c < 128+totype) { + if (c < 128+totype || (c<256 && c==legacy_map(tomap, c))) { revout: *(*out)++ = c; *outb -= 1; break; } d = c; - for (c=0; c<128-totype; c++) { - if (d == legacy_chars[ tomap[c*5/4]>>2*c%8 | - tomap[c*5/4+1]<<8-2*c%8 & 1023 ]) { - c += 128; + for (c=128+totype; c<256; c++) { + if (d == legacy_map(tomap, c)) { goto revout; } } ^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: Issues with iconv conversions (UTF-8 -> cp*) 2017-05-04 14:07 ` Rich Felker @ 2017-05-04 14:49 ` maksis . 0 siblings, 0 replies; 6+ messages in thread From: maksis . @ 2017-05-04 14:49 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 2913 bytes --] Thanks! Based on quick testing with a few different charsets, the results are now correct in both apps. From: Rich Felker<mailto:dalias@libc.org> Sent: torstai 4. toukokuuta 2017 17.18 To: musl@lists.openwall.com<mailto:musl@lists.openwall.com> Subject: Re: [musl] Issues with iconv conversions (UTF-8 -> cp*) On Thu, May 04, 2017 at 09:23:18AM +0000, maksis . wrote: > Seems like it’s better, but certain characters are converted > incorrectly by the linked test app (such as “€” -> “¬”). I also have > different characters failing in my actual C++ app. I can't reproduce this. With the patch applied I have both “€” and “¬” successfully round-tripping from UTF-8 to CP1250 and back... ..however that makes sense because the bug was out-of-bounds memory reads, whose behavior would vary by build/system/etc. (UB). Try this v2 patch. Rich > From: Rich Felker<mailto:dalias@libc.org> > Sent: torstai 4. toukokuuta 2017 0.00 > To: musl@lists.openwall.com<mailto:musl@lists.openwall.com> > Subject: Re: [musl] Issues with iconv conversions (UTF-8 -> cp*) > > On Wed, May 03, 2017 at 01:53:19PM -0400, Rich Felker wrote: > > On Wed, May 03, 2017 at 05:20:47PM +0000, maksis . wrote: > > > Hi, > > > > > > I’m experiencing issues with iconv conversions from UTF-8 to Windows codepages (cp* -> UTF-8 seems to be working fine). > > > > > > > > > Test program: https://gist.github.com/maksis/ef6562b43c94a6a29dc21b987ec4c0cf > > > > > > gcc output: UTF-8 -> cp1250 -> UTF-8: ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚„…†‡‰Š‹ŚŤŽŹ‘’“”•–—™š›śťžźˇ˘Ł¤Ą¦§¨©Ş«¬®Ż°±˛ł´µ¶·¸ąş»Ľ˝ľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ > > > > > > musl-gcc output: UTF-8 -> cp1250 -> UTF-8: ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚„…†‡‰Š‹ŚŤŽŹ‘’“”•–—™š›śťžźˇ˘Ł*Ą****Ş***Ż**˛ł*****ąş*Ľ˝ľżŔ**Ă*ĹĆ*Č*Ę*Ě**ĎĐŃŇ**Ő**ŘŮ*Ű**Ţ*ŕ**ă*ĺć*č*ę*ě**ďđńň**ő**řů*ű**ţ > > > > > > Tested with GCC 6.3.1 and musl 1.1.16 > > > > Thanks. I've confirmed that this is a bug in conversion to (not from, > > only to) legacy 8bit codepages; there's missing logic for the case > > (which is handled specially in the tables) where a unicode codepoint > > is represented by its own value (that fits in 8 bits). I'll work on a > > patch to fix it and follow up when it's either committed or ready for > > testing. > > Does the attached patch work for you? Anyone see problems with it? The > functional change is at: > > - if (c < 128+totype) { > + if (c < 128+totype || c==legacy_map(tomap, c)) { > > The rest is just refactoring to put the shared logic in a new static > function legacy_map rather than repeating it two (and now three, after > the fix) times. Further simplification via refactoring is actually > possible but I tried to keep this patch short and easy to review. > > Rich > [-- Attachment #2: Type: text/html, Size: 5481 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-05-04 14:49 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-05-03 17:20 Issues with iconv conversions (UTF-8 -> cp*) maksis . 2017-05-03 17:53 ` Rich Felker 2017-05-03 20:57 ` Rich Felker 2017-05-04 9:23 ` maksis . 2017-05-04 14:07 ` Rich Felker 2017-05-04 14:49 ` maksis .
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).