9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: Jacob Moody <moody@posixcafe.org>
To: 9fans@9fans.net
Subject: Re: [9fans] Why does utfutf() exist?
Date: Thu, 18 Dec 2025 11:13:47 -0600	[thread overview]
Message-ID: <cc164c59-1ed6-4db0-b012-1095e5e4993b@posixcafe.org> (raw)
In-Reply-To: <BCC16A4B-CD61-45EB-B8D0-277D9064BDC2@ecloud.org>

On 12/18/25 03:53, Shawn Rutledge wrote:
>> On Dec 17, 2025, at 22:17, Jacob Moody <moody@posixcafe.org> wrote:
>>
>> I've been poking at some of the utf* functions lately and utfutf is a bit puzzling.
>> At face value, strstr() should be sufficient for handling utf8 encoded strings just as strcmp() is.
> 
> Maybe normalization could be the reason: there can be multiple representations, for example, ü might be one code point (Unicode: U+00FC, UTF-8: C3 BC), or might be u with a combining umlaut.  I would assume converting to a rune would turn out the same either way: then you can compare them even if the haystack is represented one way in utf8 and the needle is the other way.  (Disclaimer: I’m not a unicode expert, even less so on 9)

No, normalization is completely orthogonal to this.
First of all, when these were written Plan 9 did not handle detached codepoints or decomposed sequences at all, so I'd
find it quite surprising if the intention was to handle them here (or in chartorune).
Also, from a design standpoint your UTF decoding is not the correct place implement normalization for a large number
of reasons, to name a few:

1. Normalization requires the context of multiple codepoints, would be quite complex for chartorune to do this as by
the standards definition a normalization context can technically be unbounded.
2. It would be quite surprising if you're goal is to read in a file and write it back out that you silently convert codepoints.
3. Normalization is not exactly cheap to perform, chartorune is in the hotpath of a lot of code.
4. One form is not inherently more correct than the other, the Unicode standard says you should treat both composed and decomposed forms as even.

If you want more context on specifically normalization, I wrote a paper about my normalization implementation for 9front that I presented at the last IWP9.


------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T8831073f8b8bb351-M0acd2a42356729165fa7d00b
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription

  parent reply	other threads:[~2025-12-18 18:44 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-17 21:17 [9fans] Why does utfutf() exist? Jacob Moody
2025-12-18  9:53 ` Shawn Rutledge
2025-12-18 15:50   ` quiekaizam via 9fans
2025-12-18 17:13   ` Jacob Moody [this message]
2025-12-18 20:16     ` Rob Pike
2025-12-18 20:48       ` Jacob Moody

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cc164c59-1ed6-4db0-b012-1095e5e4993b@posixcafe.org \
    --to=moody@posixcafe.org \
    --cc=9fans@9fans.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).