From: Maurice Quennet <maurice@quennet.eu>
To: 9front@9front.org
Subject: UTF kernel/user space discrepancy
Date: Thu, 10 Mar 2016 18:43:35 +0100 [thread overview]
Message-ID: <56E1B247.8000700@quennet.eu> (raw)
Hi,
while reading some source code I discovered that /sys/include/libc.h defines
enum
{
UTFmax = 4, /* maximum bytes per rune */
Runesync = 0x80, /* cannot represent part of a
UTF sequence (<) */
Runeself = 0x80, /* rune and UTF sequences are
the same (<) */
Runeerror = 0xFFFD, /* decoding error in UTF */
Runemax = 0x10FFFF, /* 21 bit rune */
Runemask = 0x1FFFFF, /* bits used by runes (see grep) */
};
whereas /sys/src/9/port/lib.h defines
enum
{
UTFmax = 3, /* maximum bytes per rune */
Runesync = 0x80, /* cannot represent part of a UTF
sequence */
Runeself = 0x80, /* rune and UTF sequences are the same
(<) */
Runeerror = 0xFFFD, /* decoding error in UTF */
Runemax = 0xFFFF, /* 16 bit rune */
};
I'm not sure if this is considered a bug (the system works either way),
but it struck me as odd, that the kernel and user space would use
different types of runes (just to be crystal clear: I have no technical
knowledge about UTF, whatsoever, other than "has more characters than
ASCII"). Especially, since vanilla Plan 9 consistently uses 21 bit runes
(although they call it "24 bit rune[s]" in port/lib.h).
I tested the patch below and it seems to work (on 386 …).
- Maurice
diff -r 6b193fcbc781 sys/src/9/port/lib.h
--- a/sys/src/9/port/lib.h Tue Mar 08 16:45:29 2016 +0100
+++ b/sys/src/9/port/lib.h Wed Mar 09 23:28:35 2016 +0100
@@ -35,11 +35,12 @@
enum
{
- UTFmax = 3, /* maximum bytes per rune */
- Runesync = 0x80, /* cannot represent part of a UTF sequence */
- Runeself = 0x80, /* rune and UTF sequences are the same (<) */
+ UTFmax = 4, /* maximum bytes per rune */
+ Runesync = 0x80, /* cannot represent part of a UTF sequence (<) */
+ Runeself = 0x80, /* rune and UTF sequences are the same (<) */
Runeerror = 0xFFFD, /* decoding error in UTF */
- Runemax = 0xFFFF, /* 16 bit rune */
+ Runemax = 0x10FFFF, /* 21 bit rune */
+ Runemask = 0x1FFFFF, /* bits used by runes (see grep) */
};
/*
next reply other threads:[~2016-03-10 17:43 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-03-10 17:43 Maurice Quennet [this message]
2016-03-10 18:55 ` [9front] " cinap_lenrek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56E1B247.8000700@quennet.eu \
--to=maurice@quennet.eu \
--cc=9front@9front.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).