mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Subject: Re: [musl] Collation, IDN, and Unicode normalization
Date: Wed, 28 May 2025 20:29:53 -0400	[thread overview]
Message-ID: <20250529002951.GU1827@brightrain.aerifal.cx> (raw)
In-Reply-To: <20250528064742.GS1827@brightrain.aerifal.cx>

On Wed, May 28, 2025 at 02:47:43AM -0400, Rich Felker wrote:
> On Mon, May 05, 2025 at 02:18:49PM -0400, Rich Felker wrote:
> > One aspect of LC_COLLATE support is that collation is supposed to be
> > invariant under canonical equivalence/different normalization forms,
> > while collation rules are best expressed in terms of NFD.
> > 
> > The most direct simple way to apply collation rules is to transform
> > into NFD (on the fly) as they're applied. A more time-efficient and
> > code-simplifying way is to apply a "canonical closure" (defined in
> > UTN#5) to the collation rules ahead of time. The cost is making the
> > collation tables larger (how much larger is something I still need to
> > quantify), but without using this approach, there is a table size cost
> > (as well as code and design for making this table efficient) to be
> > able to compute decompositions on the fly.
> 
> It turns out a good deal of the decomposition/normalization logic is
> needed even if you do the canonical closure approach, since you still
> need to deal with non-starters (characters with canonical combining
> class != 0) which may be reordered under normalization. For example,
> given:
> 
> 00C8 = 0045 0300 (E-grave)
> 1E1A = 0045 0330 (E-tilde_below)
> 
> 00C8 0330 is equivalent to 1E1A 0300, and both must be treated as 0045
> 0330 0300 for the purpose of applying the collation rules (because
> 0330 has lower canonical combining class than 0300).
> 
> One could imagine expanding out a closure for all possible sequences
> that could reorder to a given collation element, but this quickly
> blows up with all the non-meaningful-but-allowed sequences of
> combining marks, and AFAICT the set isn't even finite. This means it
> always requires some sort of fallback to an iterator that runs over
> a sequence of a starter followed by a number of non-starters and reads
> off the decomposed characters in order of canonical combining class,
> without actually having working space to decode them. This will be an
> important ingredient of the implementation, and it's a bit tricky to
> do right, but tractable.

OK, so I've worked out the iterator system and it's not all that bad.
I have untested code, but it depends on a backend for the
decomposition & canonical combining class tables which I haven't done
yet.

In regards to that, I have a rather unconventional idea for combining
data tables that I'd like to put forward.

First, some numbers.

There are 1046 characters in current Unicode that decompose, most of
them just to a pair of characters, but a handful to 3 or 4. (This is
not counting Hangul Jamo, which decompose algorithmically and don't
need any tables.)

There are 934 characters with nonzero canonical combining class. These
are called "non-starters". Each has an 8-bit combining class code
associated with it. There are 55 of the possible 255 values that
actually appear.

The (pardon the pun) canonical way to represent these would be to have
a multilevel table "is_decomposing" to identify characters that
decompose, and a sorted list of decompositions to binary-search when
one is hit. And to have a similar multilevel table for combining
classes, but to have it map to the class number similar to how the
case mapping table maps to a smallish set of case-mapping rules.

However, while decomposable characters and non-starters are separate
things, for practical purposes, they go hand-in-hand. They are the
characters that need special treatment when normalizing, and you have
to process both without knowing ahead of time which you're looking at.

So, what I think may make sense is combining them into a single table,
that maps "character needing normalization processing" to a sequence
(possibly length-1, if this is a non-starter not an actual
decomposable character) of *pairs* of the form [character, canonical
combining class]. That way, you get out the canonical combining
classes you need at the same time as doing the decomposition, rather
than needing a separate step to look it up for each character that
comes out of the decomposition.

The cost of storing the extra bits for the combining class is
virtually nothing, since (1) it saves having a separate parallel data
structure, and (2) Unicode characters are 21-bit (17-bit in practice
for the relevant range), leaving well over 8 bits that are just wasted
unless you do costly-to-unpack packings.

Basically, the motivation is this: the bits are there unused, so we
might as well use them to store the data we're doing to need right
afterwards anyway, and in that case, it makes sense to use the same
representation for non-starters.

I'm not sure what the ideal format for the actual table is. I was
initially looking at a multi-level bit table to evaluate the predicate
"needs normalization processing" (same as the isw*() tables in musl
now), then binary-searching a sorted list. However, a direct
multi-level table mapping to the expansions might be an option as
well. There are currently 78 256-character-aligned blocks that have
relevant characters in them. This puts a bit table at about 2.5k, and
a full map at about 20k. Vs something like 7-10k (depending on how
nasty the encoding is) for the binary-searchable list of 1980 slots
each holding a 17-bit character number and an offset to the expansion.


  parent reply	other threads:[~2025-05-29  0:30 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-05 18:18 Rich Felker
     [not found] ` <20250528064742.GS1827@brightrain.aerifal.cx>
2025-05-29  0:29   ` Rich Felker [this message]
     [not found]     ` <20250529023741.GW1827@brightrain.aerifal.cx>
2025-05-29  9:45       ` Nick Wellnhofer
     [not found]         ` <20250529133013.GX1827@brightrain.aerifal.cx>
     [not found]           ` <20250529140357.GY1827@brightrain.aerifal.cx>
2025-05-29 16:52             ` Rich Felker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250529002951.GU1827@brightrain.aerifal.cx \
    --to=dalias@libc.org \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).