9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] utf-8 handling oddities
@ 2023-10-13 20:29 la-ninpre
  2023-10-14  4:56 ` LdBeth
  0 siblings, 1 reply; 2+ messages in thread
From: la-ninpre @ 2023-10-13 20:29 UTC (permalink / raw)
  To: 9fans

greetings, 9fans.

recently i have been studying utf-8 encoding and decided to look at how it is handled in plan 9. i thought that since plan 9 was the first application of this encoding, it makes sense to look at its implementation. the fact that mentioned implementation was done by designers of the encoding themselves only adds to this decision.

so i grabbed the last release tarball from p9f.org and studied it. but when i was testing some other implementations to compare how each handles encoding/decoding errors, i noticed that the same code linked with plan9port's lib9 behaves differently (or may i say, incorrectly) when dealing with surrogate halves than that original plan 9 implementation. i started digging through archive versions of the same code only to find out that the implementation changed only after the release of fourth edition. specifically, i looked at /sys/src/libc/port/rune.c file. the version that i studied was taken from so called 'latest release' on p9f page. the timestamp on that file says that it was last modified in 2013, while the rest of the code is timestamped at 2002. inferno os source code too has this change ported to it around the same time.

if i understand it correctly, unicode extended past the BMP in 1996 with the release of unicode 2.0. plan 9 had two editions released after that, but, of course assuming that archives on p9f are indeed correct, the implementation didn't reflect the change in the code until 2013 (and that's why that old code propagated to both plan9port and 9front). so, maybe someone knows why is that the case? i'd appreciate any input on this or some pointers to information resources that you may know of.

best regards,
la ninpre.

------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T8384b8174eb88096-M127761f645d18b8419fc4f9b
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [9fans] utf-8 handling oddities
  2023-10-13 20:29 [9fans] utf-8 handling oddities la-ninpre
@ 2023-10-14  4:56 ` LdBeth
  0 siblings, 0 replies; 2+ messages in thread
From: LdBeth @ 2023-10-14  4:56 UTC (permalink / raw)
  To: 9fans

>>>>> In <1597A7B3-09D5-443F-B372-8B28F5F2B059@aaoth.xyz> 
>>>>>   la-ninpre <aaoth@aaoth.xyz> wrote:

la-ninpre> if i understand it correctly, unicode extended past the BMP
la-ninpre> in 1996 with the release of unicode 2.0. plan 9 had two
la-ninpre> editions released after that, but, of course assuming that
la-ninpre> archives on p9f are indeed correct, the implementation
la-ninpre> didn't reflect the change in the code until 2013 (and
la-ninpre> that's why that old code propagated to both plan9port and
la-ninpre> 9front). so, maybe someone knows why is that the case? i'd
la-ninpre> appreciate any input on this or some pointers to
la-ninpre> information resources that you may know of.

Fun fact, "the underlying Xerces parser used by most systems never
implemented XML 1.0 fifth edition" (which was released in 2008).

It is not uncommon for implementors to decide not cover new features
that is lesser of their interests.

Also, UTF-8 is **not required** to handle surrogate by Unicode standard
and Rob Pike has said in a relevant golang thread:

> It's correct to reject them

https://golang-dev.narkive.com/4Zves5rC/surrogate-halves-and-utf-8

which also explains the rationale of the Plan9 code.

la-ninpre> best regards,
la-ninpre> la ninpre.


---
ldbeth


------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T8384b8174eb88096-M50e3a04b5272c6334c10d2af
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-10-14  9:20 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-13 20:29 [9fans] utf-8 handling oddities la-ninpre
2023-10-14  4:56 ` LdBeth

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).