mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Subject: Re: Issues with iconv conversions (UTF-8 -> cp*)
Date: Wed, 3 May 2017 16:57:27 -0400	[thread overview]
Message-ID: <20170503205727.GC17319@brightrain.aerifal.cx> (raw)
In-Reply-To: <20170503175319.GB17319@brightrain.aerifal.cx>

[-- Attachment #1: Type: text/plain, Size: 1918 bytes --]

On Wed, May 03, 2017 at 01:53:19PM -0400, Rich Felker wrote:
> On Wed, May 03, 2017 at 05:20:47PM +0000, maksis . wrote:
> > Hi,
> > 
> > I’m experiencing issues with iconv conversions from UTF-8 to Windows codepages (cp* -> UTF-8 seems to be working fine).
> > 
> > 
> > Test program: https://gist.github.com/maksis/ef6562b43c94a6a29dc21b987ec4c0cf
> > 
> > gcc output:  UTF-8 -> cp1250 -> UTF-8: ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚„…†‡‰Š‹ŚŤŽŹ‘’“”•–—™š›śťžźˇ˘Ł¤Ą¦§¨©Ş«¬®Ż°±˛ł´µ¶·¸ąş»Ľ˝ľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ
> > 
> > musl-gcc output:  UTF-8 -> cp1250 -> UTF-8: ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚„…†‡‰Š‹ŚŤŽŹ‘’“”•–—™š›śťžźˇ˘Ł*Ą****Ş***Ż**˛ł*****ąş*Ľ˝ľżŔ**Ă*ĹĆ*Č*Ę*Ě**ĎĐŃŇ**Ő**ŘŮ*Ű**Ţ*ŕ**ă*ĺć*č*ę*ě**ďđńň**ő**řů*ű**ţ
> > 
> > Tested with GCC 6.3.1 and musl 1.1.16
> 
> Thanks. I've confirmed that this is a bug in conversion to (not from,
> only to) legacy 8bit codepages; there's missing logic for the case
> (which is handled specially in the tables) where a unicode codepoint
> is represented by its own value (that fits in 8 bits). I'll work on a
> patch to fix it and follow up when it's either committed or ready for
> testing.

Does the attached patch work for you? Anyone see problems with it? The
functional change is at:

-			if (c < 128+totype) {
+			if (c < 128+totype || c==legacy_map(tomap, c)) {

The rest is just refactoring to put the shared logic in a new static
function legacy_map rather than repeating it two (and now three, after
the fix) times. Further simplification via refactoring is actually
possible but I tried to keep this patch short and easy to review.

Rich

[-- Attachment #2: iconv_tolegacy_fix.diff --]
[-- Type: text/plain, Size: 1546 bytes --]

diff --git a/src/locale/iconv.c b/src/locale/iconv.c
index 1eeea94..5eb1843 100644
--- a/src/locale/iconv.c
+++ b/src/locale/iconv.c
@@ -151,6 +151,14 @@ static void put_32(unsigned char *s, unsigned c, int e)
 #define mbrtowc_utf8 mbrtowc
 #define wctomb_utf8 wctomb
 
+static unsigned legacy_map(const unsigned char *map, unsigned c)
+{
+	unsigned x = c - 128 + map[-1];
+	x = legacy_chars[ map[x*5/4]>>2*x%8 |
+		map[x*5/4+1]<<8-2*x%8 & 1023 ];
+	return x ? x : c;
+}
+
 size_t iconv(iconv_t cd0, char **restrict in, size_t *restrict inb, char **restrict out, size_t *restrict outb)
 {
 	size_t x=0;
@@ -364,10 +372,7 @@ size_t iconv(iconv_t cd0, char **restrict in, size_t *restrict inb, char **restr
 			break;
 		default:
 			if (c < 128+type) break;
-			c -= 128+type;
-			c = legacy_chars[ map[c*5/4]>>2*c%8 |
-				map[c*5/4+1]<<8-2*c%8 & 1023 ];
-			if (!c) c = *(unsigned char *)*in;
+			c = legacy_map(map, c);
 			if (c==1) goto ilseq;
 		}
 
@@ -392,17 +397,15 @@ size_t iconv(iconv_t cd0, char **restrict in, size_t *restrict inb, char **restr
 			if (c > 0x7f) subst: x++, c='*';
 		default:
 			if (*outb < 1) goto toobig;
-			if (c < 128+totype) {
+			if (c < 128+totype || c==legacy_map(tomap, c)) {
 			revout:
 				*(*out)++ = c;
 				*outb -= 1;
 				break;
 			}
 			d = c;
-			for (c=0; c<128-totype; c++) {
-				if (d == legacy_chars[ tomap[c*5/4]>>2*c%8 |
-					tomap[c*5/4+1]<<8-2*c%8 & 1023 ]) {
-					c += 128;
+			for (c=128+totype; c<256; c++) {
+				if (d == legacy_map(tomap, c)) {
 					goto revout;
 				}
 			}

  reply	other threads:[~2017-05-03 20:57 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-03 17:20 maksis .
2017-05-03 17:53 ` Rich Felker
2017-05-03 20:57   ` Rich Felker [this message]
2017-05-04  9:23     ` maksis .
2017-05-04 14:07       ` Rich Felker
2017-05-04 14:49         ` maksis .

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170503205727.GC17319@brightrain.aerifal.cx \
    --to=dalias@libc.org \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).