mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] Collation, IDN, and Unicode normalization
@ 2025-05-05 18:18 Rich Felker
  2025-05-28  6:47 ` Rich Felker
  0 siblings, 1 reply; 11+ messages in thread
From: Rich Felker @ 2025-05-05 18:18 UTC (permalink / raw)
  To: musl

One aspect of LC_COLLATE support is that collation is supposed to be
invariant under canonical equivalence/different normalization forms,
while collation rules are best expressed in terms of NFD.

The most direct simple way to apply collation rules is to transform
into NFD (on the fly) as they're applied. A more time-efficient and
code-simplifying way is to apply a "canonical closure" (defined in
UTN#5) to the collation rules ahead of time. The cost is making the
collation tables larger (how much larger is something I still need to
quantify), but without using this approach, there is a table size cost
(as well as code and design for making this table efficient) to be
able to compute decompositions on the fly.

Separately (and not part of the locale overhaul project), IDN support
requires the capability to perform normalization into NFKC -- maybe
not for all of Unicode, but at least for the characters that could
appear in domain names. So in theory there is possibly some value to
trying to share the [de]composition tables and use them in both
directions.

I know for a very old version of Unicode supported in my uuterm,
decomposition tables and code fit in under 8k.

I'm guessing the canonical closure for the collation data will be
a lot larger than that, even if Hangul could be special-cased and
elided. But depending on what level of collation capability we want
internal to libc, independent of having a locale definition loaded
(which would be fully-shared mmapped), this size might mainly be in
locale files on disk, as opposed to decomposition tables which would
be linked into libc.

I'll be trying to work out some quantitative data on the tradeoffs
here, but wanted to go ahead and put the topic out there, especially
since the IDN topic has come up on IRC again recently and coming up
with a good choice here might intersect with IDN stuff.

Rich

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Collation, IDN, and Unicode normalization
  2025-05-05 18:18 [musl] Collation, IDN, and Unicode normalization Rich Felker
@ 2025-05-28  6:47 ` Rich Felker
  2025-05-29  0:29   ` [musl] " Rich Felker
  0 siblings, 1 reply; 11+ messages in thread
From: Rich Felker @ 2025-05-28  6:47 UTC (permalink / raw)
  To: musl

On Mon, May 05, 2025 at 02:18:49PM -0400, Rich Felker wrote:
> One aspect of LC_COLLATE support is that collation is supposed to be
> invariant under canonical equivalence/different normalization forms,
> while collation rules are best expressed in terms of NFD.
> 
> The most direct simple way to apply collation rules is to transform
> into NFD (on the fly) as they're applied. A more time-efficient and
> code-simplifying way is to apply a "canonical closure" (defined in
> UTN#5) to the collation rules ahead of time. The cost is making the
> collation tables larger (how much larger is something I still need to
> quantify), but without using this approach, there is a table size cost
> (as well as code and design for making this table efficient) to be
> able to compute decompositions on the fly.

It turns out a good deal of the decomposition/normalization logic is
needed even if you do the canonical closure approach, since you still
need to deal with non-starters (characters with canonical combining
class != 0) which may be reordered under normalization. For example,
given:

00C8 = 0045 0300 (E-grave)
1E1A = 0045 0330 (E-tilde_below)

00C8 0330 is equivalent to 1E1A 0300, and both must be treated as 0045
0330 0300 for the purpose of applying the collation rules (because
0330 has lower canonical combining class than 0300).

One could imagine expanding out a closure for all possible sequences
that could reorder to a given collation element, but this quickly
blows up with all the non-meaningful-but-allowed sequences of
combining marks, and AFAICT the set isn't even finite. This means it
always requires some sort of fallback to an iterator that runs over
a sequence of a starter followed by a number of non-starters and reads
off the decomposed characters in order of canonical combining class,
without actually having working space to decode them. This will be an
important ingredient of the implementation, and it's a bit tricky to
do right, but tractable.

So my conclusion is that the "canonical closure" approach is really a
performance optimization trading speed for size, and doesn't offer
much or anything in the way of simplicity. I think this means we can
and should forget about it.


> Separately (and not part of the locale overhaul project), IDN support
> requires the capability to perform normalization into NFKC -- maybe
> not for all of Unicode, but at least for the characters that could
> appear in domain names. So in theory there is possibly some value to
> trying to share the [de]composition tables and use them in both
> directions.

At least the above iterator can very much be shared here.

> I know for a very old version of Unicode supported in my uuterm,
> decomposition tables and code fit in under 8k.
> 
> I'm guessing the canonical closure for the collation data will be
> a lot larger than that, even if Hangul could be special-cased and
> elided. But depending on what level of collation capability we want
> internal to libc, independent of having a locale definition loaded
> (which would be fully-shared mmapped), this size might mainly be in
> locale files on disk, as opposed to decomposition tables which would
> be linked into libc.
> 
> I'll be trying to work out some quantitative data on the tradeoffs
> here, but wanted to go ahead and put the topic out there, especially
> since the IDN topic has come up on IRC again recently and coming up
> with a good choice here might intersect with IDN stuff.

The Default Unicode Collation Element Table (DUCET) contains nearly
40k explicit mappings, on top of implicit ones. Under 5000 of those
are mappings for precomposed characters. For the secondary and
tertiary weights, there are patterns that should collapse them down a
lot, maybe as small as a fraction of a bit per entry. But the primary
weights are generally not assigned systematically, at least not for
most of the real-world, living-language scripts in the BMP making the
size at least 16 bits each for some 16k characters, and probably close
to that much for the remaining ~24k. That comes to likely 70-100k of
data, which is probably way above what we'd want to bake into libc.
(For example, a static-linked ls command would have it all due to
sorting by strcoll/strxfrm.)

So I think this points to having no baked-in default collation table
(e.g. for locales to delegate to, or for some variant of the C.UTF-8
locale to use) as the right thing to do. In a way this is good policy:
it means everything is up to the locale to specify the collation
order, incorporating the Unicode defaults (DUCET) or not, rather than
artificially favoring locales that build on it or requiring them to do
so.


So, next steps:

1. I know we need the above-described iterator, and it's something
concrete that can be written and tested independent of anything else.
So I'm going to begin working on it, and post a standalone
proof-of-concept once I have something working.

2. The output of that iterator can serve as the input to a prefix
tree, that would be the data structure for the collation table. This
needs some further checks that it works, but I think the
well-formedness conditions for collation element tables (UTS#10
section 5) ensure longest-matched-prefix is a way to select the rule
to apply.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [musl] Collation, IDN, and Unicode normalization
  2025-05-28  6:47 ` Rich Felker
@ 2025-05-29  0:29   ` Rich Felker
  2025-05-29  2:37     ` Rich Felker
  0 siblings, 1 reply; 11+ messages in thread
From: Rich Felker @ 2025-05-29  0:29 UTC (permalink / raw)
  To: musl

On Wed, May 28, 2025 at 02:47:43AM -0400, Rich Felker wrote:
> On Mon, May 05, 2025 at 02:18:49PM -0400, Rich Felker wrote:
> > One aspect of LC_COLLATE support is that collation is supposed to be
> > invariant under canonical equivalence/different normalization forms,
> > while collation rules are best expressed in terms of NFD.
> > 
> > The most direct simple way to apply collation rules is to transform
> > into NFD (on the fly) as they're applied. A more time-efficient and
> > code-simplifying way is to apply a "canonical closure" (defined in
> > UTN#5) to the collation rules ahead of time. The cost is making the
> > collation tables larger (how much larger is something I still need to
> > quantify), but without using this approach, there is a table size cost
> > (as well as code and design for making this table efficient) to be
> > able to compute decompositions on the fly.
> 
> It turns out a good deal of the decomposition/normalization logic is
> needed even if you do the canonical closure approach, since you still
> need to deal with non-starters (characters with canonical combining
> class != 0) which may be reordered under normalization. For example,
> given:
> 
> 00C8 = 0045 0300 (E-grave)
> 1E1A = 0045 0330 (E-tilde_below)
> 
> 00C8 0330 is equivalent to 1E1A 0300, and both must be treated as 0045
> 0330 0300 for the purpose of applying the collation rules (because
> 0330 has lower canonical combining class than 0300).
> 
> One could imagine expanding out a closure for all possible sequences
> that could reorder to a given collation element, but this quickly
> blows up with all the non-meaningful-but-allowed sequences of
> combining marks, and AFAICT the set isn't even finite. This means it
> always requires some sort of fallback to an iterator that runs over
> a sequence of a starter followed by a number of non-starters and reads
> off the decomposed characters in order of canonical combining class,
> without actually having working space to decode them. This will be an
> important ingredient of the implementation, and it's a bit tricky to
> do right, but tractable.

OK, so I've worked out the iterator system and it's not all that bad.
I have untested code, but it depends on a backend for the
decomposition & canonical combining class tables which I haven't done
yet.

In regards to that, I have a rather unconventional idea for combining
data tables that I'd like to put forward.

First, some numbers.

There are 1046 characters in current Unicode that decompose, most of
them just to a pair of characters, but a handful to 3 or 4. (This is
not counting Hangul Jamo, which decompose algorithmically and don't
need any tables.)

There are 934 characters with nonzero canonical combining class. These
are called "non-starters". Each has an 8-bit combining class code
associated with it. There are 55 of the possible 255 values that
actually appear.

The (pardon the pun) canonical way to represent these would be to have
a multilevel table "is_decomposing" to identify characters that
decompose, and a sorted list of decompositions to binary-search when
one is hit. And to have a similar multilevel table for combining
classes, but to have it map to the class number similar to how the
case mapping table maps to a smallish set of case-mapping rules.

However, while decomposable characters and non-starters are separate
things, for practical purposes, they go hand-in-hand. They are the
characters that need special treatment when normalizing, and you have
to process both without knowing ahead of time which you're looking at.

So, what I think may make sense is combining them into a single table,
that maps "character needing normalization processing" to a sequence
(possibly length-1, if this is a non-starter not an actual
decomposable character) of *pairs* of the form [character, canonical
combining class]. That way, you get out the canonical combining
classes you need at the same time as doing the decomposition, rather
than needing a separate step to look it up for each character that
comes out of the decomposition.

The cost of storing the extra bits for the combining class is
virtually nothing, since (1) it saves having a separate parallel data
structure, and (2) Unicode characters are 21-bit (17-bit in practice
for the relevant range), leaving well over 8 bits that are just wasted
unless you do costly-to-unpack packings.

Basically, the motivation is this: the bits are there unused, so we
might as well use them to store the data we're doing to need right
afterwards anyway, and in that case, it makes sense to use the same
representation for non-starters.

I'm not sure what the ideal format for the actual table is. I was
initially looking at a multi-level bit table to evaluate the predicate
"needs normalization processing" (same as the isw*() tables in musl
now), then binary-searching a sorted list. However, a direct
multi-level table mapping to the expansions might be an option as
well. There are currently 78 256-character-aligned blocks that have
relevant characters in them. This puts a bit table at about 2.5k, and
a full map at about 20k. Vs something like 7-10k (depending on how
nasty the encoding is) for the binary-searchable list of 1980 slots
each holding a 17-bit character number and an offset to the expansion.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Collation, IDN, and Unicode normalization
  2025-05-29  0:29   ` [musl] " Rich Felker
@ 2025-05-29  2:37     ` Rich Felker
  2025-05-29  9:45       ` [musl] " Nick Wellnhofer
  0 siblings, 1 reply; 11+ messages in thread
From: Rich Felker @ 2025-05-29  2:37 UTC (permalink / raw)
  To: musl

On Wed, May 28, 2025 at 08:29:53PM -0400, Rich Felker wrote:
> So, what I think may make sense is combining them into a single table,
> that maps "character needing normalization processing" to a sequence
> (possibly length-1, if this is a non-starter not an actual
> decomposable character) of *pairs* of the form [character, canonical
> combining class]. That way, you get out the canonical combining
> classes you need at the same time as doing the decomposition, rather
> than needing a separate step to look it up for each character that
> comes out of the decomposition.
> 
> The cost of storing the extra bits for the combining class is
> virtually nothing, since (1) it saves having a separate parallel data
> structure, and (2) Unicode characters are 21-bit (17-bit in practice
> for the relevant range), leaving well over 8 bits that are just wasted
> unless you do costly-to-unpack packings.
> 
> Basically, the motivation is this: the bits are there unused, so we
> might as well use them to store the data we're doing to need right
> afterwards anyway, and in that case, it makes sense to use the same
> representation for non-starters.
> 
> I'm not sure what the ideal format for the actual table is. I was
> initially looking at a multi-level bit table to evaluate the predicate
> "needs normalization processing" (same as the isw*() tables in musl
> now), then binary-searching a sorted list. However, a direct
> multi-level table mapping to the expansions might be an option as
> well. There are currently 78 256-character-aligned blocks that have
> relevant characters in them. This puts a bit table at about 2.5k, and
> a full map at about 20k. Vs something like 7-10k (depending on how
> nasty the encoding is) for the binary-searchable list of 1980 slots
> each holding a 17-bit character number and an offset to the expansion.

Some more numbers:

For just the real decompositions, not including non-starters, there
are only 31 blocks (~8k of multi-level table) rather than 78. That
makes it sound appealing not to combine them. However, that leaves 68
blocks with non-starters in them which also have to be processed in
some other way. I'm pretty sure there are fewer than 16 classes per
block, so we could use a 4-bit table that would get it down to about
9k. But that's still nearly as large as the combined table (like 17k
vs 20k).

Alternatively, we could use a binary search. At 934 entries, that's 10
steps, which is rather costly. One nice thing is that we can easily
pre-filter for which characters might be non-starters using wcwidth
plus a few exceptions, but it still leaves processing text with any of
these combining marks inordinately expensive.

(Aside: All of the exceptions have character class Mc, some of them
changed from Mn in previous versions of Unicode, and it's probably an
error that our current policy for wcwidth is not classifying them as
nonspacing. This needs to be reviewed at some point.)

One hybrid option that might sound like a reasonable compromise is
using the block number (codepoint>>8) to index a list of binary search
endpoints, so that the list to search is small. However the common
combining characters are all grouped together in blocks 03, 1D, 2D,
and a few others, so performance for these would stay pretty bad
(roughly 6 step search).

Overall, none of these options are sounding like a significant win
over the fast option. OK, so let's see how much space that really
takes:


Top-level table (indexed by codepoint>>8) to select a table: 1 byte
per entry, for 512 bytes.

Second-level tables (indexed by codepoint&255): each one needs a few
header bytes to indicate where its data in the expansion table starts,
plus one byte per entry that's either the offset in the expansion
table or a 0 for "no expansion". Essentially this is something like
260 bytes per block, times 78 blocks, for 20280 bytes total.

Expansion table slots (indexed by second-level table): each slot is 4
bytes. 934 single-entry slots (non-starters) for 3736 bytes. 1046
decomposition entries, each 2-4 slots. 36 are 4-slot, 231 are 3-slot,
and 779 2-slot. That's 2395 slots, for 9580 bytes.

Total across all of these is 30372 bytes.


One significant optimization I think I could make, without a cost to
performance or complexity, is using the second-level table header
bytes to encode the subrange of the block that has relevant codepoints
in it, so that the rest can be elided. I don't have the numbers yet
but I think that would cut off at least 50% of the second-level table
size, maybe more. There are some blocks with only a handful of
(usually consecutive or close) codepoints in them, and almost none
using more than half the range.

This does make it so that second-level tables no longer start and
multiple-of-table-size offsets (since they're variable size), so the
top-level table needs to be increased to 16-bit so it can store
offsets rather than indices. But that's really no big deal. It becomes
1k instead of 512 bytes.


With this change, hopefully the data can be under 20k in total.

To be clear, this is size that goes in libc.so/libc.a, but doesn't
appear in static-linked programs unless they call strcoll or strxfrm.
Eventually it may also appear for IDN in static-linked programs that
use the stub resolver.

I'll try to work out a draft of the table gen code in the next few
days.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [musl] Collation, IDN, and Unicode normalization
  2025-05-29  2:37     ` Rich Felker
@ 2025-05-29  9:45       ` Nick Wellnhofer
  2025-05-29 13:30         ` Rich Felker
  0 siblings, 1 reply; 11+ messages in thread
From: Nick Wellnhofer @ 2025-05-29  9:45 UTC (permalink / raw)
  To: musl

On May 29, 2025, at 04:37, Rich Felker <dalias@libc.org> wrote:
> Top-level table (indexed by codepoint>>8) to select a table: 1 byte
> per entry, for 512 bytes.
> 
> Second-level tables (indexed by codepoint&255):

You could also try different bit shifts that might yield smaller tables. Another option to compress the data further is to use third-level tables. I have some old code somewhere that brute-forces all combinations of shift values to find the smallest tables.

Nick


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Collation, IDN, and Unicode normalization
  2025-05-29  9:45       ` [musl] " Nick Wellnhofer
@ 2025-05-29 13:30         ` Rich Felker
  2025-05-29 14:03           ` Rich Felker
  0 siblings, 1 reply; 11+ messages in thread
From: Rich Felker @ 2025-05-29 13:30 UTC (permalink / raw)
  To: Nick Wellnhofer; +Cc: musl

On Thu, May 29, 2025 at 11:45:01AM +0200, Nick Wellnhofer wrote:
> On May 29, 2025, at 04:37, Rich Felker <dalias@libc.org> wrote:
> > Top-level table (indexed by codepoint>>8) to select a table: 1 byte
> > per entry, for 512 bytes.
> > 
> > Second-level tables (indexed by codepoint&255):
> 
> You could also try different bit shifts that might yield smaller
> tables. Another option to compress the data further is to use
> third-level tables. I have some old code somewhere that brute-forces
> all combinations of shift values to find the smallest tables.

Indeed, that's certainly a reasonable option to consider. My original
balancing between the top- and second-level sizes was based on the
1-bit tables for isw*(), where the top-level has much higher weight
and you want to keep it small. But here, where the second-level has
more weight, it may make sense to use much smaller blocks. I'll have a
look at the distribution and see if this looks like it would be better
or worse than the cheap range-limiting shortcut I mentioned before.

Thanks for reading and offering up ideas!

Rich


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Collation, IDN, and Unicode normalization
  2025-05-29 13:30         ` Rich Felker
@ 2025-05-29 14:03           ` Rich Felker
  2025-05-29 14:15             ` Thorsten Glaser
  2025-05-29 16:52             ` [musl] " Rich Felker
  0 siblings, 2 replies; 11+ messages in thread
From: Rich Felker @ 2025-05-29 14:03 UTC (permalink / raw)
  To: Nick Wellnhofer; +Cc: musl

On Thu, May 29, 2025 at 09:30:13AM -0400, Rich Felker wrote:
> On Thu, May 29, 2025 at 11:45:01AM +0200, Nick Wellnhofer wrote:
> > On May 29, 2025, at 04:37, Rich Felker <dalias@libc.org> wrote:
> > > Top-level table (indexed by codepoint>>8) to select a table: 1 byte
> > > per entry, for 512 bytes.
> > > 
> > > Second-level tables (indexed by codepoint&255):
> > 
> > You could also try different bit shifts that might yield smaller
> > tables. Another option to compress the data further is to use
> > third-level tables. I have some old code somewhere that brute-forces
> > all combinations of shift values to find the smallest tables.
> 
> Indeed, that's certainly a reasonable option to consider. My original
> balancing between the top- and second-level sizes was based on the
> 1-bit tables for isw*(), where the top-level has much higher weight
> and you want to keep it small. But here, where the second-level has
> more weight, it may make sense to use much smaller blocks. I'll have a
> look at the distribution and see if this looks like it would be better
> or worse than the cheap range-limiting shortcut I mentioned before.
> 
> Thanks for reading and offering up ideas!

Via some simple grep sed cut and uniq:

grep -E '^([^;]*;){3}[^0].*' UnicodeData.txt | cut -d';' -f1 | sed -e 's/[8-9A-F].$/80/' -e 's/[0-7].$/00/' | uniq -c

there are only 101 128-codepoint blocks vs 68 256-codepoint blocks.

This is just looking at the non-starters, since the decomposing
characters are much less of a burden already.

Extending that further, there are only 135 64-codepoint blocks. But 84
of those just have 1-3 codepoints in them, the rest still being dead
space. This suggests to me that the range check and elision of all but
the used range of the block is going to have *much* bigger returns.

Still, it's possible to combine both and that might turn out to be
really effective vs just doing the range checks. In fact that looks
like it might be the case. Looking at my output up through 0FFF, just
the 64-codepoint blocks with 1-3 entries:

3 0840
1 0900
1 0980
2 09C0
1 0A00
1 0A40
1 0A80
1 0AC0
1 0B00
1 0B40
1 0BC0
1 0C00
3 0C40
1 0C80
1 0CC0
2 0D00
1 0D40
1 0DC0
3 0E00
3 0E80
1 0FC0

we see that almost all of the 256-codepoint blocks have sporadic
entries that would require a range of at least 64 (and often 128-192)
to be mapped, but become range-1 blocks if split down on 64-codepoint
granularity.

The degree to which these are sporadic suggests an even more optimal
solution, which I don't think I want to pursue, but it should be
mentioned: with a 1bit table to exclude codepoints that are not
subject to normalization processing, I think one could compute a
perfect hash function to map all of them to a flat array with
relatively few unused slots. This would require the perfect hash
building logic to go into musl-chartable-tools. In this solution, the
expected total table size would be rougly 14-15k. That's not really a
whole lot smaller than what I expect we can get the multi-level table
down to. If you take out the actual expansions, which are 9580 bytes,
it's a better comparison. Those could be packed tighter to get it down
further, but improving them is independent of the way we map to them.

OK let's estimate the 64-codepoint block multilevel table size. There
are 51 nontrivial (more than 1-3) blocks so let's estimate them all at
full 66-byte size, and the rest (84) at 10 bytes each. That's 84*10 +
51*66 = 3366 bytes. Plus the 2048-byte top-level table. That's 5414
bytes, basically the same as the perfect hash with 1-bit rejection
table. (Note: the 1-bit rejection table can't really be optimized much
by excluding ranges, because we're down to the level of reasonable
storage granularity.) So I think this is looking really good, and
about half of the 30k we were looking at before.

Rich


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Collation, IDN, and Unicode normalization
  2025-05-29 14:03           ` Rich Felker
@ 2025-05-29 14:15             ` Thorsten Glaser
  2025-05-29 15:59               ` Rich Felker
  2025-05-29 16:52             ` [musl] " Rich Felker
  1 sibling, 1 reply; 11+ messages in thread
From: Thorsten Glaser @ 2025-05-29 14:15 UTC (permalink / raw)
  To: musl

Rich Felker dixit:

>Extending that further, there are only 135 64-codepoint blocks. But 84
>of those just have 1-3 codepoints in them, the rest still being dead

Is it an option to do blocks on the higher level and once you reach
the block of a certain size (64? 128? 256?) do a linear search, if
most really have that few codepoints?

Hm probably would suck for the 0300 block though.

Just had that crazy idea, not sure if it’s viable.

bye,
//mirabilos
-- 
15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Collation, IDN, and Unicode normalization
  2025-05-29 14:15             ` Thorsten Glaser
@ 2025-05-29 15:59               ` Rich Felker
  0 siblings, 0 replies; 11+ messages in thread
From: Rich Felker @ 2025-05-29 15:59 UTC (permalink / raw)
  To: Thorsten Glaser; +Cc: musl

On Thu, May 29, 2025 at 02:15:59PM +0000, Thorsten Glaser wrote:
> Rich Felker dixit:
> 
> >Extending that further, there are only 135 64-codepoint blocks. But 84
> >of those just have 1-3 codepoints in them, the rest still being dead
> 
> Is it an option to do blocks on the higher level and once you reach
> the block of a certain size (64? 128? 256?) do a linear search, if
> most really have that few codepoints?
> 
> Hm probably would suck for the 0300 block though.
> 
> Just had that crazy idea, not sure if it’s viable.

Yeah that's the problem with linear (or rather binary, no reason to do
linear) search options: the places you take a performance hit are
exactly the ones where all the common characters lie.

Rich


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [musl] Collation, IDN, and Unicode normalization
  2025-05-29 14:03           ` Rich Felker
  2025-05-29 14:15             ` Thorsten Glaser
@ 2025-05-29 16:52             ` Rich Felker
  2025-05-31 18:49               ` Rich Felker
  1 sibling, 1 reply; 11+ messages in thread
From: Rich Felker @ 2025-05-29 16:52 UTC (permalink / raw)
  To: Nick Wellnhofer; +Cc: musl

On Thu, May 29, 2025 at 10:03:57AM -0400, Rich Felker wrote:
> down to. If you take out the actual expansions, which are 9580 bytes,
> it's a better comparison. Those could be packed tighter to get it down
> further, but improving them is independent of the way we map to them.

OK, those can be cut down a lot.

The 9580 figure was from 2395 slots each 32-bit, each holding a 17-bit
codepoint and 8-bit combining class.

However, a quick analysis shows there are only 94 characters that ever
appear as the non-initial position of an expansion. This means the
actual expansions can be a series of a potentially larger initial
member followed by 7 bits (plus 1 continuation-flag bit?) for each
subsequent character in the expansion.

That requires a 94-entry table to map the possibilities to actual
character numbers and combining classes, but the table can easily be
24-bit packed rather than 32-bit if we want, as 282 bytes, or just
aligned 32-bit as 376 bytes.

There are 317 possible characters that appear in the first slot of an
expansion, and some of these overlap, because there are only 406 total
that appear in any position.

We could put all of those 406 in a table, with the 94 at the beginning
so they can be referenced by a single byte. 

For non-starters which expand to themselves, we can have an additional
55 slots, one for each possible combining class, with a dummy
character value meaning "expands to itself" and only the combining
class value being used. That increases the number of entries to 461.

That's 1383 or 1844 bytes depending on whether we pack to 3-byte
entries.

Now, what about the reference for the initial character of each
expansion? Do we make them 2-byte? How about instead, making all of
the references a variable-length code. A byte x<128 represents itself;
a byte x>=128 represents (x-128)*128+y where y is the next byte. This
future-proofs things allowing the number of characters that could
appear to be as high at 16384, and lets us use a single-byte for a lot
of the leading characters of expansions too.

Or, since we're not going to need anywhere near that many, we could
reserve only a few bits for the variable-length codind. For example,
if x<240 represents itself, all of the 94 non-initial decomposition
characters and all of the 55 non-starter class codes fit in a single
byte, and we can still represent everything else in the range of
(x-240)*240+y which reaches up to 3839 (over 8x the presently needed
range).

Now, let's estimate sizes again:

If there are 2395 slots, 1046 of them initial, let's assume the
initial ones take 2 bytes each (some can be less) in the expansion
table and the rest take 1 byte each. That's 1349 + 2*1046 = 3441 bytes
for expansions. Plus the above 1383 bytes for the table of possible
character, gives 4824 bytes. Basically half of what we started at.

One final note: I've kinda handwaved being able to pack the table of
possible characters into 3 bytes each rather than 4, since they're
17-bit characters with 8-bit combining classes. However, in the
non-BMP characters, only 9 of the possible 55 classes ever appear, so
we can just steal 9 of the unassigned slots to use as "this means one
of those 9, plus set the high bit on the character value". However,
given that the difference in table size vs using a 32-bit unpacked
table is only 461 bytes, and that the unpacking code probably costs
nearly that much in .text size, it's probably better to just forget
that and use the 4-byte version at 1844 bytes. This makes the total
size 5285 bytes, which is still incredibly small, and it also avoids
implementing a packing we might have to rip out later if Unicode
expands character assignment in ways that don't fit.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Collation, IDN, and Unicode normalization
  2025-05-29 16:52             ` [musl] " Rich Felker
@ 2025-05-31 18:49               ` Rich Felker
  0 siblings, 0 replies; 11+ messages in thread
From: Rich Felker @ 2025-05-31 18:49 UTC (permalink / raw)
  To: Nick Wellnhofer; +Cc: musl

On Thu, May 29, 2025 at 12:52:49PM -0400, Rich Felker wrote:
> On Thu, May 29, 2025 at 10:03:57AM -0400, Rich Felker wrote:
> > down to. If you take out the actual expansions, which are 9580 bytes,
> > it's a better comparison. Those could be packed tighter to get it down
> > further, but improving them is independent of the way we map to them.
> 
> OK, those can be cut down a lot.
> 
> The 9580 figure was from 2395 slots each 32-bit, each holding a 17-bit
> codepoint and 8-bit combining class.
> 
> However, a quick analysis shows there are only 94 characters that ever
> appear as the non-initial position of an expansion. This means the
> actual expansions can be a series of a potentially larger initial
> member followed by 7 bits (plus 1 continuation-flag bit?) for each
> subsequent character in the expansion.
> 
> That requires a 94-entry table to map the possibilities to actual
> character numbers and combining classes, but the table can easily be
> 24-bit packed rather than 32-bit if we want, as 282 bytes, or just
> aligned 32-bit as 376 bytes.
> 
> There are 317 possible characters that appear in the first slot of an
> expansion, and some of these overlap, because there are only 406 total
> that appear in any position.
> 
> We could put all of those 406 in a table, with the 94 at the beginning
> so they can be referenced by a single byte. 
> 
> For non-starters which expand to themselves, we can have an additional
> 55 slots, one for each possible combining class, with a dummy
> character value meaning "expands to itself" and only the combining
> class value being used. That increases the number of entries to 461.
> 
> That's 1383 or 1844 bytes depending on whether we pack to 3-byte
> entries.
> 
> Now, what about the reference for the initial character of each
> expansion? Do we make them 2-byte? How about instead, making all of
> the references a variable-length code. A byte x<128 represents itself;
> a byte x>=128 represents (x-128)*128+y where y is the next byte. This
> future-proofs things allowing the number of characters that could
> appear to be as high at 16384, and lets us use a single-byte for a lot
> of the leading characters of expansions too.
> 
> Or, since we're not going to need anywhere near that many, we could
> reserve only a few bits for the variable-length codind. For example,
> if x<240 represents itself, all of the 94 non-initial decomposition
> characters and all of the 55 non-starter class codes fit in a single
> byte, and we can still represent everything else in the range of
> (x-240)*240+y which reaches up to 3839 (over 8x the presently needed
> range).
> 
> Now, let's estimate sizes again:
> 
> If there are 2395 slots, 1046 of them initial, let's assume the
> initial ones take 2 bytes each (some can be less) in the expansion
> table and the rest take 1 byte each. That's 1349 + 2*1046 = 3441 bytes
> for expansions. Plus the above 1383 bytes for the table of possible
> character, gives 4824 bytes. Basically half of what we started at.
> 
> One final note: I've kinda handwaved being able to pack the table of
> possible characters into 3 bytes each rather than 4, since they're
> 17-bit characters with 8-bit combining classes. However, in the
> non-BMP characters, only 9 of the possible 55 classes ever appear, so
> we can just steal 9 of the unassigned slots to use as "this means one
> of those 9, plus set the high bit on the character value". However,
> given that the difference in table size vs using a 32-bit unpacked
> table is only 461 bytes, and that the unpacking code probably costs
> nearly that much in .text size, it's probably better to just forget
> that and use the 4-byte version at 1844 bytes. This makes the total
> size 5285 bytes, which is still incredibly small, and it also avoids
> implementing a packing we might have to rip out later if Unicode
> expands character assignment in ways that don't fit.

Small setback that, without other changes, makes throwing away that
packing rather costly:

While there are only a small number of characters that appear in
multi-character decompositions, I failed to include single-character
decompositions ("singletons" in Unicode normalization terminology).
This brings the number (together with nonstarters self-map entries)
that appear in decompositions up to 1375. Most of these are the
characters only present for round-tripping legacy character sets, such
as Å vs "ANGSTROM SIGN", Ω vs "OHM SIGN", and of course all the "CJK
COMPATIBILITY IDEOGRAPH" codepoints.

For what it's worth, there are 3 odd compat characters:

0340;COMBINING GRAVE TONE MARK;Mn;230;NSM;0300;;;;N;NON-SPACING GRAVE TONE MARK;;;;
0341;COMBINING ACUTE TONE MARK;Mn;230;NSM;0301;;;;N;NON-SPACING ACUTE TONE MARK;;;;
0343;COMBINING GREEK KORONIS;Mn;230;NSM;0313;;;;N;;;;;

that decompose respectively to nonstarters:

0300;COMBINING GRAVE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING GRAVE;;;;
0301;COMBINING ACUTE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING ACUTE;;;;
0313;COMBINING COMMA ABOVE;Mn;230;NSM;;;;;N;NON-SPACING COMMA ABOVE;;;;

All the rest of the singletons decompose to starters (don't need to
store a combining class). Since the above 3 seem to have basically
just been a historical mistake, and the bulk of new singletons are CJK
COMPAT, I think it makes sense to optimize around assuming most
future-added singletons will be starters (not need a combining class).

So..

It's really nasty to be spending 4 or even 3 bytes per entry as a
"character that appears in decompositions" for "character that appears
once in a single decomposition because it was accidentally encoded
twice in some obscure legacy character set". This is on top of needing
an entry in, and a 2-byte reference out of, the expansions table, not
to mention increasing the number and size of populated second-level
tables.

One possibility is to allow coding characters directly in the
expansion table rather than as references to a commonly appearing
character. The VLC so far is really only defined as:

    0..239: single byte index
    240..255: head of double byte index

But since that whole range isn't needed, it could be expanded to:

    0..239: single byte index
    240..251: head of double byte index
    252..255: top 2 bits of an immediate 2-byte character that follows

This would allow representing any character up to U+3FFFF (note: some
in the 2xxxx range appear as CJK COMPAT mappings), as long as it's a
starter, without needing an entry in the list of chars that appear in
decompositions, with only 3 bytes (vs 2) in the expansion table.
That's saving 3 (or 2, if we could pack the latter down to 3 total)
bytes per singleton. It might help reduce the size of some existing
non-singleton decompositions too.

Indeed, only 135 characters appear 3 or more times in decompositions.

Including singletons, 1054 only appear once.

Excluding singletons, still 231 appear only once. (These cost more to
put in a table that to inline in the expansions.)

If we only include in the table characters which either appear at
least 3 times, or which are non-starters, double-byte indices wouldn't
get used at all. Each expansion would either be a single-byte index or
an inline character value.

If we don't actually need double-byte indices, we could in theory
recover the values used for them to encode inline characters more
efficiently. But I don't think it really helps that much. The
locations where individual characters aren't common in decompositions,
but where characters with matching upper bits are common, is fairly
sporadic, spread over the 00, 04, 05, 0F, 22, 30 blocks and the entire
CJK ideographs range. In theory we could pick like the 16 most
commonly appearing blocks and encode those in the lead byte, but it
would save a few hundred bytes total at most, and it's gratuitously
ugly and dependent on either manually picking the best blocks or
marshalling results of the optimization logic into code. I think this
is a bad direction to pursue.




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-05-31 18:49 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-05 18:18 [musl] Collation, IDN, and Unicode normalization Rich Felker
2025-05-28  6:47 ` Rich Felker
2025-05-29  0:29   ` [musl] " Rich Felker
2025-05-29  2:37     ` Rich Felker
2025-05-29  9:45       ` [musl] " Nick Wellnhofer
2025-05-29 13:30         ` Rich Felker
2025-05-29 14:03           ` Rich Felker
2025-05-29 14:15             ` Thorsten Glaser
2025-05-29 15:59               ` Rich Felker
2025-05-29 16:52             ` [musl] " Rich Felker
2025-05-31 18:49               ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).