9front - general discussion about 9front
 help / color / mirror / Atom feed
From: Maurice Quennet <maurice@quennet.eu>
To: 9front@9front.org
Subject: UTF kernel/user space discrepancy
Date: Thu, 10 Mar 2016 18:43:35 +0100	[thread overview]
Message-ID: <56E1B247.8000700@quennet.eu> (raw)

Hi,

while reading some source code I discovered that /sys/include/libc.h defines

enum
{
         UTFmax          = 4,            /* maximum bytes per rune */
         Runesync        = 0x80,         /* cannot represent part of a 
UTF sequence (<) */
         Runeself        = 0x80,         /* rune and UTF sequences are 
the same (<) */
         Runeerror       = 0xFFFD,       /* decoding error in UTF */
         Runemax         = 0x10FFFF,     /* 21 bit rune */
         Runemask        = 0x1FFFFF,     /* bits used by runes (see grep) */
};

whereas /sys/src/9/port/lib.h defines

enum
{
         UTFmax          = 3,    /* maximum bytes per rune */
         Runesync        = 0x80, /* cannot represent part of a UTF 
sequence */
         Runeself        = 0x80, /* rune and UTF sequences are the same 
(<) */
         Runeerror       = 0xFFFD,       /* decoding error in UTF */
         Runemax         = 0xFFFF,       /* 16 bit rune */
};

I'm not sure if this is considered a bug (the system works either way), 
but it struck me as odd, that the kernel and user space would use 
different types of runes (just to be crystal clear: I have no technical 
knowledge about UTF, whatsoever, other than "has more characters than 
ASCII"). Especially, since vanilla Plan 9 consistently uses 21 bit runes 
(although they call it "24 bit rune[s]" in port/lib.h).

I tested the patch below and it seems to work (on 386 …).

- Maurice


diff -r 6b193fcbc781 sys/src/9/port/lib.h
--- a/sys/src/9/port/lib.h	Tue Mar 08 16:45:29 2016 +0100
+++ b/sys/src/9/port/lib.h	Wed Mar 09 23:28:35 2016 +0100
@@ -35,11 +35,12 @@

  enum
  {
-	UTFmax		= 3,	/* maximum bytes per rune */
-	Runesync	= 0x80,	/* cannot represent part of a UTF sequence */
-	Runeself	= 0x80,	/* rune and UTF sequences are the same (<) */
+	UTFmax		= 4,		/* maximum bytes per rune */
+	Runesync	= 0x80,		/* cannot represent part of a UTF sequence (<) */
+	Runeself	= 0x80,		/* rune and UTF sequences are the same (<) */
  	Runeerror	= 0xFFFD,	/* decoding error in UTF */
-	Runemax		= 0xFFFF,	/* 16 bit rune */
+	Runemax		= 0x10FFFF,	/* 21 bit rune */
+	Runemask	= 0x1FFFFF,	/* bits used by runes (see grep) */
  };

  /*


             reply	other threads:[~2016-03-10 17:43 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-10 17:43 Maurice Quennet [this message]
2016-03-10 18:55 ` [9front] " cinap_lenrek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56E1B247.8000700@quennet.eu \
    --to=maurice@quennet.eu \
    --cc=9front@9front.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).