mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Rich Felker <dalias@libc.org>
To: Nick Wellnhofer <wellnhofer@aevum.de>
Cc: musl@lists.openwall.com
Subject: Re: [musl] Collation, IDN, and Unicode normalization
Date: Thu, 29 May 2025 12:52:49 -0400	[thread overview]
Message-ID: <20250529165249.GB1827@brightrain.aerifal.cx> (raw)
In-Reply-To: <20250529140357.GY1827@brightrain.aerifal.cx>

On Thu, May 29, 2025 at 10:03:57AM -0400, Rich Felker wrote:
> down to. If you take out the actual expansions, which are 9580 bytes,
> it's a better comparison. Those could be packed tighter to get it down
> further, but improving them is independent of the way we map to them.

OK, those can be cut down a lot.

The 9580 figure was from 2395 slots each 32-bit, each holding a 17-bit
codepoint and 8-bit combining class.

However, a quick analysis shows there are only 94 characters that ever
appear as the non-initial position of an expansion. This means the
actual expansions can be a series of a potentially larger initial
member followed by 7 bits (plus 1 continuation-flag bit?) for each
subsequent character in the expansion.

That requires a 94-entry table to map the possibilities to actual
character numbers and combining classes, but the table can easily be
24-bit packed rather than 32-bit if we want, as 282 bytes, or just
aligned 32-bit as 376 bytes.

There are 317 possible characters that appear in the first slot of an
expansion, and some of these overlap, because there are only 406 total
that appear in any position.

We could put all of those 406 in a table, with the 94 at the beginning
so they can be referenced by a single byte. 

For non-starters which expand to themselves, we can have an additional
55 slots, one for each possible combining class, with a dummy
character value meaning "expands to itself" and only the combining
class value being used. That increases the number of entries to 461.

That's 1383 or 1844 bytes depending on whether we pack to 3-byte
entries.

Now, what about the reference for the initial character of each
expansion? Do we make them 2-byte? How about instead, making all of
the references a variable-length code. A byte x<128 represents itself;
a byte x>=128 represents (x-128)*128+y where y is the next byte. This
future-proofs things allowing the number of characters that could
appear to be as high at 16384, and lets us use a single-byte for a lot
of the leading characters of expansions too.

Or, since we're not going to need anywhere near that many, we could
reserve only a few bits for the variable-length codind. For example,
if x<240 represents itself, all of the 94 non-initial decomposition
characters and all of the 55 non-starter class codes fit in a single
byte, and we can still represent everything else in the range of
(x-240)*240+y which reaches up to 3839 (over 8x the presently needed
range).

Now, let's estimate sizes again:

If there are 2395 slots, 1046 of them initial, let's assume the
initial ones take 2 bytes each (some can be less) in the expansion
table and the rest take 1 byte each. That's 1349 + 2*1046 = 3441 bytes
for expansions. Plus the above 1383 bytes for the table of possible
character, gives 4824 bytes. Basically half of what we started at.

One final note: I've kinda handwaved being able to pack the table of
possible characters into 3 bytes each rather than 4, since they're
17-bit characters with 8-bit combining classes. However, in the
non-BMP characters, only 9 of the possible 55 classes ever appear, so
we can just steal 9 of the unassigned slots to use as "this means one
of those 9, plus set the high bit on the character value". However,
given that the difference in table size vs using a 32-bit unpacked
table is only 461 bytes, and that the unpacking code probably costs
nearly that much in .text size, it's probably better to just forget
that and use the 4-byte version at 1844 bytes. This makes the total
size 5285 bytes, which is still incredibly small, and it also avoids
implementing a packing we might have to rip out later if Unicode
expands character assignment in ways that don't fit.


      parent reply	other threads:[~2025-05-29 16:53 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-05 18:18 Rich Felker
     [not found] ` <20250528064742.GS1827@brightrain.aerifal.cx>
2025-05-29  0:29   ` Rich Felker
     [not found]     ` <20250529023741.GW1827@brightrain.aerifal.cx>
2025-05-29  9:45       ` Nick Wellnhofer
     [not found]         ` <20250529133013.GX1827@brightrain.aerifal.cx>
     [not found]           ` <20250529140357.GY1827@brightrain.aerifal.cx>
2025-05-29 16:52             ` Rich Felker [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250529165249.GB1827@brightrain.aerifal.cx \
    --to=dalias@libc.org \
    --cc=musl@lists.openwall.com \
    --cc=wellnhofer@aevum.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).