From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from wp266.webpack.hosteurope.de ([80.237.133.35]) by ur; Thu Mar 10 12:43:40 EST 2016 Received: from p4fc24e2d.dip0.t-ipconnect.de ([79.194.78.45] helo=[192.168.2.100]); authenticated by wp266.webpack.hosteurope.de running ExIM with esmtpsa (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16) id 1ae4ca-0004b9-1a; Thu, 10 Mar 2016 18:43:36 +0100 To: 9front@9front.org From: Maurice Quennet Subject: UTF kernel/user space discrepancy Message-ID: <56E1B247.8000700@quennet.eu> Date: Thu, 10 Mar 2016 18:43:35 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-bounce-key: webpack.hosteurope.de;maurice@quennet.eu;1457631821;a9f47f62; List-ID: <9front.9front.org> List-Help: X-Glyph: ➈ X-Bullshit: virtual high-performance cloud-aware realtime-java wrapper Hi, while reading some source code I discovered that /sys/include/libc.h defines enum { UTFmax = 4, /* maximum bytes per rune */ Runesync = 0x80, /* cannot represent part of a UTF sequence (<) */ Runeself = 0x80, /* rune and UTF sequences are the same (<) */ Runeerror = 0xFFFD, /* decoding error in UTF */ Runemax = 0x10FFFF, /* 21 bit rune */ Runemask = 0x1FFFFF, /* bits used by runes (see grep) */ }; whereas /sys/src/9/port/lib.h defines enum { UTFmax = 3, /* maximum bytes per rune */ Runesync = 0x80, /* cannot represent part of a UTF sequence */ Runeself = 0x80, /* rune and UTF sequences are the same (<) */ Runeerror = 0xFFFD, /* decoding error in UTF */ Runemax = 0xFFFF, /* 16 bit rune */ }; I'm not sure if this is considered a bug (the system works either way), but it struck me as odd, that the kernel and user space would use different types of runes (just to be crystal clear: I have no technical knowledge about UTF, whatsoever, other than "has more characters than ASCII"). Especially, since vanilla Plan 9 consistently uses 21 bit runes (although they call it "24 bit rune[s]" in port/lib.h). I tested the patch below and it seems to work (on 386 …). - Maurice diff -r 6b193fcbc781 sys/src/9/port/lib.h --- a/sys/src/9/port/lib.h Tue Mar 08 16:45:29 2016 +0100 +++ b/sys/src/9/port/lib.h Wed Mar 09 23:28:35 2016 +0100 @@ -35,11 +35,12 @@ enum { - UTFmax = 3, /* maximum bytes per rune */ - Runesync = 0x80, /* cannot represent part of a UTF sequence */ - Runeself = 0x80, /* rune and UTF sequences are the same (<) */ + UTFmax = 4, /* maximum bytes per rune */ + Runesync = 0x80, /* cannot represent part of a UTF sequence (<) */ + Runeself = 0x80, /* rune and UTF sequences are the same (<) */ Runeerror = 0xFFFD, /* decoding error in UTF */ - Runemax = 0xFFFF, /* 16 bit rune */ + Runemax = 0x10FFFF, /* 21 bit rune */ + Runemask = 0x1FFFFF, /* bits used by runes (see grep) */ }; /*