From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/12587 Path: news.gmane.org!.POSTED!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Efficient case mapping Date: Tue, 6 Mar 2018 15:27:41 -0500 Message-ID: <20180306202741.GK1436@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: blaine.gmane.org 1520367962 32476 195.159.176.226 (6 Mar 2018 20:26:02 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 6 Mar 2018 20:26:02 +0000 (UTC) User-Agent: Mutt/1.5.21 (2010-09-15) To: musl@lists.openwall.com Original-X-From: musl-return-12603-gllmg-musl=m.gmane.org@lists.openwall.com Tue Mar 06 21:25:58 2018 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1etJ9i-00075l-Ht for gllmg-musl@m.gmane.org; Tue, 06 Mar 2018 21:25:50 +0100 Original-Received: (qmail 11491 invoked by uid 550); 6 Mar 2018 20:27:54 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 11455 invoked from network); 6 Mar 2018 20:27:54 -0000 Content-Disposition: inline Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:12587 Archived-At: Replacing the compact but slow linear-search-based case mapping tables with something comparably compact but fast has been on the hidden roadmap for a while now, and I think I have a good solution: Analysis of Unicode blocks (256-char) shows that most blocks have either no case mappings at all, or just a single common delta between lower and upper case for the whole block. There are about 11 blocks that break this rule and have more. The most unique deltas with nontrivial frequency (same delta used by more than a single-digit number of characters) in any block is 3, appearing only in block 04. As such, a single-direction case mapping can be achieved as follows. Use two multi-level bit tables (same format as ctype bits) to assign a 2-bit number to each codepoint, and 2 integer deltas (the two most common) to each block. The meanings are: 00 - no case mapping 01 - map via block's first delta 10 - map via block's second delta 11 - exception; binary search tiny table of exceptions Obviously this can be extended to two directions via a second pair of tables, but I think we can do better, just 3 bits total: 000 - no case mappings 001 - is lowercase; map to uppercase via block's first toupper delta 010 - is uppercase; map to lowercase via block's first tolower delta 011 - exception; binary search tiny table of exceptions 100 - is lowercase; map to uppercase via block's second toupper delta 101 - is lowercase; map to uppercase via block's third toupper delta 110 - is uppercase; map to lowercase via block's second tolower delta 111 - is uppercase; map to lowercase via block's third tolower delta Most blocks will use no bits at all (no case mappings). A few blocks that only contain lowercase with a single delta will use just one bit. Many blocks with case mappings will use 2 bits (only one frequent delta for each direction). The remaining few will use 3 bits, and now they can code not just 2 but 3 frequent deltas in each direction, getting rid of a lot of exceptions. Some quick size estimates (hope I didn't mess this up): 11 complex blocks -- 11 blocks * 256/8 bytes * 3 bits 7 simple blocks -- 7 blocks * 256/8 bytes * 1 bit Top-level tables -- 512 blocks * 1 byte * 3 tables Exceptions -- ~96 pairs * 4 bytes * 2 sortings Looks like it comes to about 3k total. Further optimizing: We're not using all the data we already have. Only alphabetic characters can have case mappings, and we already have an iswalpha table; the above tables are highly correlated with it. If we short-circuit out the whole operation with iswalpha, then the "000 - no case mappings" value is no longer common. Instead, "zero is one of the common deltas" can be included or not on a per-block basis. We can also add, along with the per-block common deltas, a table of which cases each delta goes with, and assign new bit meanings: 00 - is case[0] for block, map to ~case[0] via delta[0] 01 - is case[1] for block, map to ~case[1] via delta[1] 10 - is case[2] for block, map to ~case[2] via delta[2] 11 - is exception, map via binary search of exceptions table If needed we could add back a third bit here, but I think we can get by without it. Note that, because iswalpha is now short-circuiting non-alphabetic characters, we can just assign an arbitrary table value (best: 00) to all non-alphabetic (including unassigned) characters. This drops most of the "simple blocks" (blocks with just 0 or a single delta) down to trivial (elided) blocks. These last ideas ("further optimizing:" and beyond) were worked out while writing this email, and I don't yet have any analysis/test code to go with them. Will follow up later with results. With some luck, final version will be barely-larger than current code, extensible, and fast. :) Rich