mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] Collation, IDN, and Unicode normalization
@ 2025-05-05 18:18 Rich Felker
  0 siblings, 0 replies; only message in thread
From: Rich Felker @ 2025-05-05 18:18 UTC (permalink / raw)
  To: musl

One aspect of LC_COLLATE support is that collation is supposed to be
invariant under canonical equivalence/different normalization forms,
while collation rules are best expressed in terms of NFD.

The most direct simple way to apply collation rules is to transform
into NFD (on the fly) as they're applied. A more time-efficient and
code-simplifying way is to apply a "canonical closure" (defined in
UTN#5) to the collation rules ahead of time. The cost is making the
collation tables larger (how much larger is something I still need to
quantify), but without using this approach, there is a table size cost
(as well as code and design for making this table efficient) to be
able to compute decompositions on the fly.

Separately (and not part of the locale overhaul project), IDN support
requires the capability to perform normalization into NFKC -- maybe
not for all of Unicode, but at least for the characters that could
appear in domain names. So in theory there is possibly some value to
trying to share the [de]composition tables and use them in both
directions.

I know for a very old version of Unicode supported in my uuterm,
decomposition tables and code fit in under 8k.

I'm guessing the canonical closure for the collation data will be
a lot larger than that, even if Hangul could be special-cased and
elided. But depending on what level of collation capability we want
internal to libc, independent of having a locale definition loaded
(which would be fully-shared mmapped), this size might mainly be in
locale files on disk, as opposed to decomposition tables which would
be linked into libc.

I'll be trying to work out some quantitative data on the tradeoffs
here, but wanted to go ahead and put the topic out there, especially
since the IDN topic has come up on IRC again recently and coming up
with a good choice here might intersect with IDN stuff.

Rich

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2025-05-05 18:19 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-05 18:18 [musl] Collation, IDN, and Unicode normalization Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).