9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] Why does utfutf() exist?
@ 2025-12-17 21:17 Jacob Moody
  2025-12-18  9:53 ` Shawn Rutledge
  0 siblings, 1 reply; 6+ messages in thread
From: Jacob Moody @ 2025-12-17 21:17 UTC (permalink / raw)
  To: 9fans

I've been poking at some of the utf* functions lately and utfutf is a bit puzzling.
At face value, strstr() should be sufficient for handling utf8 encoded strings just as strcmp() is.
These functions have largely been the same since 9front imported them, so modifying them starts to drift
into "touching the artwork" dangers, so I wanted to think aloud(more like ramble) here and see if folks agree.


So first, the implementation of utfutf itself to see if it's doing something different than strstr:

char*
utfutf(char *s1, char *s2)
{
        char *p;
        long f, n1, n2;
        Rune r;

        n1 = chartorune(&r, s2);
        f = r;
        if(f <= Runesync)               /* represents self */
                return strstr(s1, s2);

        n2 = strlen(s2);
        for(p=s1; p=utfrune(p, f); p+=n1)
                if(strncmp(p, s2, n2) == 0)
                        return p;
        return 0;
}

We do see that in the case of a leading ascii byte we do indeed just use strstr().
However do note that the check should be < not <= as Runeself is 0x80.
If we do start with a multi-byte utf8 sequence we do a normal strstr like approach but use utfrune().
So let's take a look at utfrune():

char*
utfrune(char *s, long c)
{
        long c1;
        Rune r;
        int n;

        if(c < Runesync)                /* not part of utf sequence */
                return strchr(s, c);

        for(;;) {
                c1 = *(uchar*)s;
                if(c1 < Runeself) {     /* one byte rune */
                        if(c1 == 0)
                                return 0;
                        if(c1 == c)
                                return s;
                        s++;
                        continue;
                }
                n = chartorune(&r, s);
                if(r == c)
                        return s;
                s += n;
        }
}

So we can ignore the < Runesync case, since we won't hit that.
What lays left is a simple iteration and check against the passed value.
So let's look at strstr and see if there's a reason to avoid it:

char*
strstr(char *s1, char *s2)
{
        char *p, *pa, *pb;
        int c0, c;

        c0 = *s2;
        if(c0 == 0)
                return s1;
        s2++;
        for(p=strchr(s1, c0); p; p=strchr(p+1, c0)) {
                pa = p;
                for(pb=s2;; pb++) {
                        c = *pb;
                        if(c == 0)
                                return p;
                        if(c != *++pa)
                                break;
                }
        }
        return 0;
}

By my reading nothing here breaks when dealing with utf8, you are not as efficient
because on each iteration you call strchr with p+1, which means you need to skip through
the remaining parts of the current sequence but compared to calling chartorune() on each
non-ascii character I think it'll still wind out on top. (Would like to verify though).
Reading the remaining bytes is safe because the beginning of a valid utf8 sequence can
never be confused with the middle of one by definition.

Ok, so my thought here is perhaps this is for handling invalid utf-8 string. So let's
walk through that.

In utfutf, we only check the first rune of s2, so assuming that is invalid and we get Runerror,
we then call utfrune(), which does it's own chartorune and can return Runerror for a different
invalid sequence of bytes. That seems too strange to be intentional but I am unsure.
Additionally this only happens for the first sequence, further sequences are compared as-is
with the strncmp() call, so again this doesn't seem intentional.

With that being said, I have a purposed cleanup of these two functions:

char*
utfrune(char *s, long c)
{
        Rune r;
        char buf[UTFmax + 1] = {0};

        if(c < Runesync)                /* not part of utf sequence */
                return strchr(s, c);

        r = c;
        runetochar(buf, &r);
        return strstr(s, buf);
}

/* might as well keep it for old code */
char*
utfutf(char *s1, char *s2)
{
        return strstr(s1, s2);
}

A quick grep of the 9front source tree shows no use of utfutf(),
qwx pointed me to some other usecases in sources and around github,
however a cursory look showed that non of them were relying on behavior
that strstr would not satisfy.

So I am asking here for any historical context, if there is some.


Thanks,
moody


------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T8831073f8b8bb351-M19c2dede80e2b8439fa4c68e
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-12-18 21:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-17 21:17 [9fans] Why does utfutf() exist? Jacob Moody
2025-12-18  9:53 ` Shawn Rutledge
2025-12-18 15:50   ` quiekaizam via 9fans
2025-12-18 17:13   ` Jacob Moody
2025-12-18 20:16     ` Rob Pike
2025-12-18 20:48       ` Jacob Moody

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).