From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/12391
Path: news.gmane.org!.POSTED!not-for-mail
From: Rich Felker <dalias@libc.org>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: Updating Unicode support
Date: Wed, 24 Jan 2018 18:32:55 -0500
Message-ID: <20180124233255.GA1627@brightrain.aerifal.cx>
References: <20180123015446.vera7ocpvgaqvkss@sinister.lan.codevat.com>
 <20180123233857.GW1627@brightrain.aerifal.cx>
 <20180124005133.pdcypbus23yrikgg@sinister.lan.codevat.com>
 <20180124062602.3nn7xiwo4mgor57y@sinister.lan.codevat.com>
 <20180124214853.GZ1627@brightrain.aerifal.cx>
 <20180124222506.vrr6vmi5pbsxojvb@sinister.lan.codevat.com>
 <20180124225318.edwuzdu53c7f2sts@sinister.lan.codevat.com>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: blaine.gmane.org 1516836679 21321 195.159.176.226 (24 Jan 2018 23:31:19 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Wed, 24 Jan 2018 23:31:19 +0000 (UTC)
User-Agent: Mutt/1.5.21 (2010-09-15)
To: musl@lists.openwall.com
Original-X-From: musl-return-12407-gllmg-musl=m.gmane.org@lists.openwall.com Thu Jan 25 00:31:15 2018
Return-path: <musl-return-12407-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by blaine.gmane.org with smtp (Exim 4.84_2)
	(envelope-from <musl-return-12407-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1eeUVW-0004o0-A1
	for gllmg-musl@m.gmane.org; Thu, 25 Jan 2018 00:31:06 +0100
Original-Received: (qmail 30323 invoked by uid 550); 24 Jan 2018 23:33:08 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Original-Received: (qmail 30301 invoked from network); 24 Jan 2018 23:33:07 -0000
Content-Disposition: inline
In-Reply-To: <20180124225318.edwuzdu53c7f2sts@sinister.lan.codevat.com>
Original-Sender: Rich Felker <dalias@aerifal.cx>
Xref: news.gmane.org gmane.linux.lib.musl.general:12391
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/12391>

On Wed, Jan 24, 2018 at 02:53:18PM -0800, Eric Pruitt wrote:
> On Wed, Jan 24, 2018 at 02:25:06PM -0800, Eric Pruitt wrote:
> > On Wed, Jan 24, 2018 at 04:48:53PM -0500, Rich Felker wrote:
> > > > I updated my copy of musl to 1.1.18 then recompiled it with and without
> > > > my utf8proc changes using GCC 6.3.0 "-O3" targeting Linux 4.9.0 /
> > > > x86_64:
> > > >
> > > > - Original implementation: 2,762,774B (musl-1.1.18/lib/libc.a)
> > > > - utf8proc implementation: 3,055,954B (musl-1.1.18/lib/libc.a)
> > > > - The utf8proc implementation is ~11% larger. I didn't do any
> > > >   performance comparisons.
> > >
> > > You're comparing the whole library, not character tables. If you
> > > compare against all of ctype, it's a 15x size increase. If you compare
> > > against just wcwidth, it's a 69x increase.
> >
> > That was intentional. I have no clue what the common case is for other
> > people that use musl, but most applications **I** use make use of
> > various parts of musl, so I did the comparison on the library as a
> > whole.
> 
> If the size of utf8proc tables is a problem, I'm not sure how you'd go
> about implementing UCA without them in an efficient manner.

I don't think this is actually a productive discussion because the
metrics we're looking at aren't really meaningful. The libc.a size
doesn't tell you anything about how much of the code actually gets
linked when you use wcwidth, etc. Sorry, I should have noticed that
earlier.

> Part of the
> UCA requires normalizing the Unicode strings and also needs character
> property data to determine what sequence of characters in one string is
> compared to a sequence of characters in another string. Perhaps you
> could compromise by simply ignoring certain characters and not doing
> normalization at all.

There's currently nothing in libc that depends on any sort of
normalization, but IDN support (which has a patch pending) is related
and may need normalization to meet user expectations. If so, doing
normalization efficiently is a relevant problem for musl.

I forget if UCA actually needs normalization or not. I seem to
remember working out that you could just expand the collation tables
(mechanically at locale generation time) to account for variations in
composed/decomposed forms so that no normalization phase would be
necessary at runtime, but I may have been mistaken. It's been a long
time since I looked at it.

> Since the utf8proc maintainer seems receptive to my proposed change, I'm
> going to implement the collation feature in utf8proc, and if you decide

For sure.

> that utf8proc is worth the bloat, you'll get collation logic for "free."

That's not even the question, because we can't use outside libraries
directly. We could import code, but in the past that's been a really
bad idea (see TRE), or ideas/data structures behind the code, though.
It was probably a mistake of me to bring up the size/efficiency topic
to begin with, since it's not the core point, but I did want to
emphasize that the implementations we have were designed not just to
be simple and fairly small, but to be really small in the sense of
hard to make anything smaller.

Rich