* [9fans] Why does utfutf() exist?
@ 2025-12-17 21:17 Jacob Moody
2025-12-18 9:53 ` Shawn Rutledge
0 siblings, 1 reply; 6+ messages in thread
From: Jacob Moody @ 2025-12-17 21:17 UTC (permalink / raw)
To: 9fans
I've been poking at some of the utf* functions lately and utfutf is a bit puzzling.
At face value, strstr() should be sufficient for handling utf8 encoded strings just as strcmp() is.
These functions have largely been the same since 9front imported them, so modifying them starts to drift
into "touching the artwork" dangers, so I wanted to think aloud(more like ramble) here and see if folks agree.
So first, the implementation of utfutf itself to see if it's doing something different than strstr:
char*
utfutf(char *s1, char *s2)
{
char *p;
long f, n1, n2;
Rune r;
n1 = chartorune(&r, s2);
f = r;
if(f <= Runesync) /* represents self */
return strstr(s1, s2);
n2 = strlen(s2);
for(p=s1; p=utfrune(p, f); p+=n1)
if(strncmp(p, s2, n2) == 0)
return p;
return 0;
}
We do see that in the case of a leading ascii byte we do indeed just use strstr().
However do note that the check should be < not <= as Runeself is 0x80.
If we do start with a multi-byte utf8 sequence we do a normal strstr like approach but use utfrune().
So let's take a look at utfrune():
char*
utfrune(char *s, long c)
{
long c1;
Rune r;
int n;
if(c < Runesync) /* not part of utf sequence */
return strchr(s, c);
for(;;) {
c1 = *(uchar*)s;
if(c1 < Runeself) { /* one byte rune */
if(c1 == 0)
return 0;
if(c1 == c)
return s;
s++;
continue;
}
n = chartorune(&r, s);
if(r == c)
return s;
s += n;
}
}
So we can ignore the < Runesync case, since we won't hit that.
What lays left is a simple iteration and check against the passed value.
So let's look at strstr and see if there's a reason to avoid it:
char*
strstr(char *s1, char *s2)
{
char *p, *pa, *pb;
int c0, c;
c0 = *s2;
if(c0 == 0)
return s1;
s2++;
for(p=strchr(s1, c0); p; p=strchr(p+1, c0)) {
pa = p;
for(pb=s2;; pb++) {
c = *pb;
if(c == 0)
return p;
if(c != *++pa)
break;
}
}
return 0;
}
By my reading nothing here breaks when dealing with utf8, you are not as efficient
because on each iteration you call strchr with p+1, which means you need to skip through
the remaining parts of the current sequence but compared to calling chartorune() on each
non-ascii character I think it'll still wind out on top. (Would like to verify though).
Reading the remaining bytes is safe because the beginning of a valid utf8 sequence can
never be confused with the middle of one by definition.
Ok, so my thought here is perhaps this is for handling invalid utf-8 string. So let's
walk through that.
In utfutf, we only check the first rune of s2, so assuming that is invalid and we get Runerror,
we then call utfrune(), which does it's own chartorune and can return Runerror for a different
invalid sequence of bytes. That seems too strange to be intentional but I am unsure.
Additionally this only happens for the first sequence, further sequences are compared as-is
with the strncmp() call, so again this doesn't seem intentional.
With that being said, I have a purposed cleanup of these two functions:
char*
utfrune(char *s, long c)
{
Rune r;
char buf[UTFmax + 1] = {0};
if(c < Runesync) /* not part of utf sequence */
return strchr(s, c);
r = c;
runetochar(buf, &r);
return strstr(s, buf);
}
/* might as well keep it for old code */
char*
utfutf(char *s1, char *s2)
{
return strstr(s1, s2);
}
A quick grep of the 9front source tree shows no use of utfutf(),
qwx pointed me to some other usecases in sources and around github,
however a cursory look showed that non of them were relying on behavior
that strstr would not satisfy.
So I am asking here for any historical context, if there is some.
Thanks,
moody
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T8831073f8b8bb351-M19c2dede80e2b8439fa4c68e
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [9fans] Why does utfutf() exist?
2025-12-17 21:17 [9fans] Why does utfutf() exist? Jacob Moody
@ 2025-12-18 9:53 ` Shawn Rutledge
2025-12-18 15:50 ` quiekaizam via 9fans
2025-12-18 17:13 ` Jacob Moody
0 siblings, 2 replies; 6+ messages in thread
From: Shawn Rutledge @ 2025-12-18 9:53 UTC (permalink / raw)
To: 9fans
> On Dec 17, 2025, at 22:17, Jacob Moody <moody@posixcafe.org> wrote:
>
> I've been poking at some of the utf* functions lately and utfutf is a bit puzzling.
> At face value, strstr() should be sufficient for handling utf8 encoded strings just as strcmp() is.
Maybe normalization could be the reason: there can be multiple representations, for example, ü might be one code point (Unicode: U+00FC, UTF-8: C3 BC), or might be u with a combining umlaut. I would assume converting to a rune would turn out the same either way: then you can compare them even if the haystack is represented one way in utf8 and the needle is the other way. (Disclaimer: I’m not a unicode expert, even less so on 9)
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T8831073f8b8bb351-Mcf1aad549b2989d69b4d6347
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [9fans] Why does utfutf() exist?
2025-12-18 9:53 ` Shawn Rutledge
@ 2025-12-18 15:50 ` quiekaizam via 9fans
2025-12-18 17:13 ` Jacob Moody
1 sibling, 0 replies; 6+ messages in thread
From: quiekaizam via 9fans @ 2025-12-18 15:50 UTC (permalink / raw)
To: 9fans, Shawn Rutledge
[-- Attachment #1: Type: text/plain, Size: 1516 bytes --]
> I would assume converting to a rune would turn out the same either way:
This sounds wrong to me. IIUC Runes are just Unicode code points. Glyphs may have multiple representations in Unicode, of which your ü is a good example. Mapping these representations together is a question of Unicode normalization, however, and involves lots of fiddly questions whose answers are specific to the particular use case. As such, conversation to Runes cannot reasonably perform normalization AFAIU.
2025年12月18日 18:53:35 JST、Shawn Rutledge <lists@ecloud.org> より:
>> On Dec 17, 2025, at 22:17, Jacob Moody <moody@posixcafe.org> wrote:
>>
>> I've been poking at some of the utf* functions lately and utfutf is a bit puzzling.
>> At face value, strstr() should be sufficient for handling utf8 encoded strings just as strcmp() is.
>
> Maybe normalization could be the reason: there can be multiple representations, for example, ü might be one code point (Unicode: U+00FC, UTF-8: C3 BC), or might be u with a combining umlaut. I would assume converting to a rune would turn out the same either way: then you can compare them even if the haystack is represented one way in utf8 and the needle is the other way. (Disclaimer: I’m not a unicode expert, even less so on 9)
>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T8831073f8b8bb351-Mb71f0b6c34b98f89c7952434
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 2850 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [9fans] Why does utfutf() exist?
2025-12-18 9:53 ` Shawn Rutledge
2025-12-18 15:50 ` quiekaizam via 9fans
@ 2025-12-18 17:13 ` Jacob Moody
2025-12-18 20:16 ` Rob Pike
1 sibling, 1 reply; 6+ messages in thread
From: Jacob Moody @ 2025-12-18 17:13 UTC (permalink / raw)
To: 9fans
On 12/18/25 03:53, Shawn Rutledge wrote:
>> On Dec 17, 2025, at 22:17, Jacob Moody <moody@posixcafe.org> wrote:
>>
>> I've been poking at some of the utf* functions lately and utfutf is a bit puzzling.
>> At face value, strstr() should be sufficient for handling utf8 encoded strings just as strcmp() is.
>
> Maybe normalization could be the reason: there can be multiple representations, for example, ü might be one code point (Unicode: U+00FC, UTF-8: C3 BC), or might be u with a combining umlaut. I would assume converting to a rune would turn out the same either way: then you can compare them even if the haystack is represented one way in utf8 and the needle is the other way. (Disclaimer: I’m not a unicode expert, even less so on 9)
No, normalization is completely orthogonal to this.
First of all, when these were written Plan 9 did not handle detached codepoints or decomposed sequences at all, so I'd
find it quite surprising if the intention was to handle them here (or in chartorune).
Also, from a design standpoint your UTF decoding is not the correct place implement normalization for a large number
of reasons, to name a few:
1. Normalization requires the context of multiple codepoints, would be quite complex for chartorune to do this as by
the standards definition a normalization context can technically be unbounded.
2. It would be quite surprising if you're goal is to read in a file and write it back out that you silently convert codepoints.
3. Normalization is not exactly cheap to perform, chartorune is in the hotpath of a lot of code.
4. One form is not inherently more correct than the other, the Unicode standard says you should treat both composed and decomposed forms as even.
If you want more context on specifically normalization, I wrote a paper about my normalization implementation for 9front that I presented at the last IWP9.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T8831073f8b8bb351-M0acd2a42356729165fa7d00b
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [9fans] Why does utfutf() exist?
2025-12-18 17:13 ` Jacob Moody
@ 2025-12-18 20:16 ` Rob Pike
2025-12-18 20:48 ` Jacob Moody
0 siblings, 1 reply; 6+ messages in thread
From: Rob Pike @ 2025-12-18 20:16 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 561 bytes --]
It was written for UTF (sic), not UTF-8, which was non-synchronizable and
therefore ambiguous so care needed to be taken when looking for byte
sequences. It stayed around after that. It may not be necessary, although
without checking I can't say whether it behaves the same as strstr when
illegal UTF-8 encodings occur.
-rob
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T8831073f8b8bb351-Mc0a8f3e30279a0099c29c8e3
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1378 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [9fans] Why does utfutf() exist?
2025-12-18 20:16 ` Rob Pike
@ 2025-12-18 20:48 ` Jacob Moody
0 siblings, 0 replies; 6+ messages in thread
From: Jacob Moody @ 2025-12-18 20:48 UTC (permalink / raw)
To: 9fans
On 12/18/25 14:16, Rob Pike wrote:
> It was written for UTF (sic), not UTF-8, which was non-synchronizable and therefore ambiguous so care needed to be taken when looking for byte sequences. It stayed around after that. It may not be necessary, although without checking I can't say whether it behaves the same as strstr when illegal UTF-8 encodings occur.
>
Thanks! That makes sense.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T8831073f8b8bb351-M65fc39da67f62ef7c3b081e6
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-12-18 21:00 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-17 21:17 [9fans] Why does utfutf() exist? Jacob Moody
2025-12-18 9:53 ` Shawn Rutledge
2025-12-18 15:50 ` quiekaizam via 9fans
2025-12-18 17:13 ` Jacob Moody
2025-12-18 20:16 ` Rob Pike
2025-12-18 20:48 ` Jacob Moody
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).