* [musl] Collation, IDN, and Unicode normalization
@ 2025-05-05 18:18 Rich Felker
[not found] ` <20250528064742.GS1827@brightrain.aerifal.cx>
0 siblings, 1 reply; 4+ messages in thread
From: Rich Felker @ 2025-05-05 18:18 UTC (permalink / raw)
To: musl
One aspect of LC_COLLATE support is that collation is supposed to be
invariant under canonical equivalence/different normalization forms,
while collation rules are best expressed in terms of NFD.
The most direct simple way to apply collation rules is to transform
into NFD (on the fly) as they're applied. A more time-efficient and
code-simplifying way is to apply a "canonical closure" (defined in
UTN#5) to the collation rules ahead of time. The cost is making the
collation tables larger (how much larger is something I still need to
quantify), but without using this approach, there is a table size cost
(as well as code and design for making this table efficient) to be
able to compute decompositions on the fly.
Separately (and not part of the locale overhaul project), IDN support
requires the capability to perform normalization into NFKC -- maybe
not for all of Unicode, but at least for the characters that could
appear in domain names. So in theory there is possibly some value to
trying to share the [de]composition tables and use them in both
directions.
I know for a very old version of Unicode supported in my uuterm,
decomposition tables and code fit in under 8k.
I'm guessing the canonical closure for the collation data will be
a lot larger than that, even if Hangul could be special-cased and
elided. But depending on what level of collation capability we want
internal to libc, independent of having a locale definition loaded
(which would be fully-shared mmapped), this size might mainly be in
locale files on disk, as opposed to decomposition tables which would
be linked into libc.
I'll be trying to work out some quantitative data on the tradeoffs
here, but wanted to go ahead and put the topic out there, especially
since the IDN topic has come up on IRC again recently and coming up
with a good choice here might intersect with IDN stuff.
Rich
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [musl] Collation, IDN, and Unicode normalization
[not found] ` <20250528064742.GS1827@brightrain.aerifal.cx>
@ 2025-05-29 0:29 ` Rich Felker
[not found] ` <20250529023741.GW1827@brightrain.aerifal.cx>
0 siblings, 1 reply; 4+ messages in thread
From: Rich Felker @ 2025-05-29 0:29 UTC (permalink / raw)
To: musl
On Wed, May 28, 2025 at 02:47:43AM -0400, Rich Felker wrote:
> On Mon, May 05, 2025 at 02:18:49PM -0400, Rich Felker wrote:
> > One aspect of LC_COLLATE support is that collation is supposed to be
> > invariant under canonical equivalence/different normalization forms,
> > while collation rules are best expressed in terms of NFD.
> >
> > The most direct simple way to apply collation rules is to transform
> > into NFD (on the fly) as they're applied. A more time-efficient and
> > code-simplifying way is to apply a "canonical closure" (defined in
> > UTN#5) to the collation rules ahead of time. The cost is making the
> > collation tables larger (how much larger is something I still need to
> > quantify), but without using this approach, there is a table size cost
> > (as well as code and design for making this table efficient) to be
> > able to compute decompositions on the fly.
>
> It turns out a good deal of the decomposition/normalization logic is
> needed even if you do the canonical closure approach, since you still
> need to deal with non-starters (characters with canonical combining
> class != 0) which may be reordered under normalization. For example,
> given:
>
> 00C8 = 0045 0300 (E-grave)
> 1E1A = 0045 0330 (E-tilde_below)
>
> 00C8 0330 is equivalent to 1E1A 0300, and both must be treated as 0045
> 0330 0300 for the purpose of applying the collation rules (because
> 0330 has lower canonical combining class than 0300).
>
> One could imagine expanding out a closure for all possible sequences
> that could reorder to a given collation element, but this quickly
> blows up with all the non-meaningful-but-allowed sequences of
> combining marks, and AFAICT the set isn't even finite. This means it
> always requires some sort of fallback to an iterator that runs over
> a sequence of a starter followed by a number of non-starters and reads
> off the decomposed characters in order of canonical combining class,
> without actually having working space to decode them. This will be an
> important ingredient of the implementation, and it's a bit tricky to
> do right, but tractable.
OK, so I've worked out the iterator system and it's not all that bad.
I have untested code, but it depends on a backend for the
decomposition & canonical combining class tables which I haven't done
yet.
In regards to that, I have a rather unconventional idea for combining
data tables that I'd like to put forward.
First, some numbers.
There are 1046 characters in current Unicode that decompose, most of
them just to a pair of characters, but a handful to 3 or 4. (This is
not counting Hangul Jamo, which decompose algorithmically and don't
need any tables.)
There are 934 characters with nonzero canonical combining class. These
are called "non-starters". Each has an 8-bit combining class code
associated with it. There are 55 of the possible 255 values that
actually appear.
The (pardon the pun) canonical way to represent these would be to have
a multilevel table "is_decomposing" to identify characters that
decompose, and a sorted list of decompositions to binary-search when
one is hit. And to have a similar multilevel table for combining
classes, but to have it map to the class number similar to how the
case mapping table maps to a smallish set of case-mapping rules.
However, while decomposable characters and non-starters are separate
things, for practical purposes, they go hand-in-hand. They are the
characters that need special treatment when normalizing, and you have
to process both without knowing ahead of time which you're looking at.
So, what I think may make sense is combining them into a single table,
that maps "character needing normalization processing" to a sequence
(possibly length-1, if this is a non-starter not an actual
decomposable character) of *pairs* of the form [character, canonical
combining class]. That way, you get out the canonical combining
classes you need at the same time as doing the decomposition, rather
than needing a separate step to look it up for each character that
comes out of the decomposition.
The cost of storing the extra bits for the combining class is
virtually nothing, since (1) it saves having a separate parallel data
structure, and (2) Unicode characters are 21-bit (17-bit in practice
for the relevant range), leaving well over 8 bits that are just wasted
unless you do costly-to-unpack packings.
Basically, the motivation is this: the bits are there unused, so we
might as well use them to store the data we're doing to need right
afterwards anyway, and in that case, it makes sense to use the same
representation for non-starters.
I'm not sure what the ideal format for the actual table is. I was
initially looking at a multi-level bit table to evaluate the predicate
"needs normalization processing" (same as the isw*() tables in musl
now), then binary-searching a sorted list. However, a direct
multi-level table mapping to the expansions might be an option as
well. There are currently 78 256-character-aligned blocks that have
relevant characters in them. This puts a bit table at about 2.5k, and
a full map at about 20k. Vs something like 7-10k (depending on how
nasty the encoding is) for the binary-searchable list of 1980 slots
each holding a 17-bit character number and an offset to the expansion.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [musl] Collation, IDN, and Unicode normalization
[not found] ` <20250529023741.GW1827@brightrain.aerifal.cx>
@ 2025-05-29 9:45 ` Nick Wellnhofer
[not found] ` <20250529133013.GX1827@brightrain.aerifal.cx>
0 siblings, 1 reply; 4+ messages in thread
From: Nick Wellnhofer @ 2025-05-29 9:45 UTC (permalink / raw)
To: musl
On May 29, 2025, at 04:37, Rich Felker <dalias@libc.org> wrote:
> Top-level table (indexed by codepoint>>8) to select a table: 1 byte
> per entry, for 512 bytes.
>
> Second-level tables (indexed by codepoint&255):
You could also try different bit shifts that might yield smaller tables. Another option to compress the data further is to use third-level tables. I have some old code somewhere that brute-forces all combinations of shift values to find the smallest tables.
Nick
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [musl] Collation, IDN, and Unicode normalization
[not found] ` <20250529140357.GY1827@brightrain.aerifal.cx>
@ 2025-05-29 16:52 ` Rich Felker
0 siblings, 0 replies; 4+ messages in thread
From: Rich Felker @ 2025-05-29 16:52 UTC (permalink / raw)
To: Nick Wellnhofer; +Cc: musl
On Thu, May 29, 2025 at 10:03:57AM -0400, Rich Felker wrote:
> down to. If you take out the actual expansions, which are 9580 bytes,
> it's a better comparison. Those could be packed tighter to get it down
> further, but improving them is independent of the way we map to them.
OK, those can be cut down a lot.
The 9580 figure was from 2395 slots each 32-bit, each holding a 17-bit
codepoint and 8-bit combining class.
However, a quick analysis shows there are only 94 characters that ever
appear as the non-initial position of an expansion. This means the
actual expansions can be a series of a potentially larger initial
member followed by 7 bits (plus 1 continuation-flag bit?) for each
subsequent character in the expansion.
That requires a 94-entry table to map the possibilities to actual
character numbers and combining classes, but the table can easily be
24-bit packed rather than 32-bit if we want, as 282 bytes, or just
aligned 32-bit as 376 bytes.
There are 317 possible characters that appear in the first slot of an
expansion, and some of these overlap, because there are only 406 total
that appear in any position.
We could put all of those 406 in a table, with the 94 at the beginning
so they can be referenced by a single byte.
For non-starters which expand to themselves, we can have an additional
55 slots, one for each possible combining class, with a dummy
character value meaning "expands to itself" and only the combining
class value being used. That increases the number of entries to 461.
That's 1383 or 1844 bytes depending on whether we pack to 3-byte
entries.
Now, what about the reference for the initial character of each
expansion? Do we make them 2-byte? How about instead, making all of
the references a variable-length code. A byte x<128 represents itself;
a byte x>=128 represents (x-128)*128+y where y is the next byte. This
future-proofs things allowing the number of characters that could
appear to be as high at 16384, and lets us use a single-byte for a lot
of the leading characters of expansions too.
Or, since we're not going to need anywhere near that many, we could
reserve only a few bits for the variable-length codind. For example,
if x<240 represents itself, all of the 94 non-initial decomposition
characters and all of the 55 non-starter class codes fit in a single
byte, and we can still represent everything else in the range of
(x-240)*240+y which reaches up to 3839 (over 8x the presently needed
range).
Now, let's estimate sizes again:
If there are 2395 slots, 1046 of them initial, let's assume the
initial ones take 2 bytes each (some can be less) in the expansion
table and the rest take 1 byte each. That's 1349 + 2*1046 = 3441 bytes
for expansions. Plus the above 1383 bytes for the table of possible
character, gives 4824 bytes. Basically half of what we started at.
One final note: I've kinda handwaved being able to pack the table of
possible characters into 3 bytes each rather than 4, since they're
17-bit characters with 8-bit combining classes. However, in the
non-BMP characters, only 9 of the possible 55 classes ever appear, so
we can just steal 9 of the unassigned slots to use as "this means one
of those 9, plus set the high bit on the character value". However,
given that the difference in table size vs using a 32-bit unpacked
table is only 461 bytes, and that the unpacking code probably costs
nearly that much in .text size, it's probably better to just forget
that and use the 4-byte version at 1844 bytes. This makes the total
size 5285 bytes, which is still incredibly small, and it also avoids
implementing a packing we might have to rip out later if Unicode
expands character assignment in ways that don't fit.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-05-29 16:53 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-05 18:18 [musl] Collation, IDN, and Unicode normalization Rich Felker
[not found] ` <20250528064742.GS1827@brightrain.aerifal.cx>
2025-05-29 0:29 ` Rich Felker
[not found] ` <20250529023741.GW1827@brightrain.aerifal.cx>
2025-05-29 9:45 ` Nick Wellnhofer
[not found] ` <20250529133013.GX1827@brightrain.aerifal.cx>
[not found] ` <20250529140357.GY1827@brightrain.aerifal.cx>
2025-05-29 16:52 ` Rich Felker
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).