From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/12390 Path: news.gmane.org!.POSTED!not-for-mail From: Eric Pruitt Newsgroups: gmane.linux.lib.musl.general Subject: Re: Updating Unicode support Date: Wed, 24 Jan 2018 14:53:18 -0800 Message-ID: <20180124225318.edwuzdu53c7f2sts@sinister.lan.codevat.com> References: <20180123015446.vera7ocpvgaqvkss@sinister.lan.codevat.com> <20180123233857.GW1627@brightrain.aerifal.cx> <20180124005133.pdcypbus23yrikgg@sinister.lan.codevat.com> <20180124062602.3nn7xiwo4mgor57y@sinister.lan.codevat.com> <20180124214853.GZ1627@brightrain.aerifal.cx> <20180124222506.vrr6vmi5pbsxojvb@sinister.lan.codevat.com> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: blaine.gmane.org 1516834309 13040 195.159.176.226 (24 Jan 2018 22:51:49 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Wed, 24 Jan 2018 22:51:49 +0000 (UTC) User-Agent: NeoMutt/20170113 (1.7.2) To: musl@lists.openwall.com Original-X-From: musl-return-12406-gllmg-musl=m.gmane.org@lists.openwall.com Wed Jan 24 23:51:45 2018 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1eeTtE-0002Hb-9v for gllmg-musl@m.gmane.org; Wed, 24 Jan 2018 23:51:32 +0100 Original-Received: (qmail 28018 invoked by uid 550); 24 Jan 2018 22:53:33 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 27998 invoked from network); 24 Jan 2018 22:53:32 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:subject:message-id:references:mime-version :content-disposition:in-reply-to:pgp-key:user-agent; bh=DLvUtjlF4kIHGaNdmA7tOXfcQ3eQDhCrGSAH0A9IobU=; b=P+c/JabVmdN1vboR/ZRFZGqMK/rMw4h08v6rh2DyduWJdzjGQDCKZ2MZ4DmuTJERHJ pBRfVP5tZ3TmU6kJeHm6LByDN64eaD+pCd2R9lMlUP/MuGxSW/rHuckojlLiq0wN+Lpr dtvfdfgYw+tK83eqaiCYThRFYfzfrh8oe0tY4Kh7WM0PPtoDWQHKu2w27Fb3rxNhCjIW j+0dob2P4QMJ8dqBsFYLslX+IBAx38D+bY1j5Xt+7pCI1BiGBUilrGHz2sJBqnODu3ij wq+Lf630fIbQa5VtralF959hXWkvkTAQ/HRtOckc6TXzJIOVvd+/Dt6qq0ahzYIJHFId 9DdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:subject:message-id:references :mime-version:content-disposition:in-reply-to:pgp-key:user-agent; bh=DLvUtjlF4kIHGaNdmA7tOXfcQ3eQDhCrGSAH0A9IobU=; b=h0hlCQUyr5N7TUewFQ0RK2JPIEKQ8YTSRxxmsPPurDA/7TlQWokPWd39ZOdNYLl7/5 rLf9lNsvn+7DlJIX0lUV3YFrdTYxVTe+MvEL8UKY2xpC7OipBRNsa2kXZKfpENtCFbcQ krtT6HwsWIRjEdjfOKkuaLDtfjHRcSvrcsZrVpb784iiyeFe4dZYRiPIUsgktIVEWFzj GNIJJ5YB6YgftBBdtajUhQd7SBXg6Ilw0N4A8vpeOtOGrWOePIgMnNS5nsbxG01x23GQ HMtCeWXdkN6qJPHUvCVQvIJeiNRab4FDbePrKu/uW18X0m8OQZIJrMnGH0giOQMvyN2s KmOQ== X-Gm-Message-State: AKwxytfg7ptiPkjKLO+Jgx0D9rMT5mw1iVrpeyE4PO3nrsansZTab9XT ba7xkmoZjodDoTbXDtCc9iR/Lg== X-Google-Smtp-Source: AH8x224RweKK9WYkgI1ij7SOfH4X/oV9vtKk9At4D8yVLcLopA5AECeROKN0djKZOcg6MctZCHS+2w== X-Received: by 10.101.96.44 with SMTP id p12mr11445355pgu.390.1516834400237; Wed, 24 Jan 2018 14:53:20 -0800 (PST) Content-Disposition: inline In-Reply-To: <20180124222506.vrr6vmi5pbsxojvb@sinister.lan.codevat.com> PGP-Key: https://www.codevat.com/pgp.asc#F8601B5D2511B4C3535232488DDDE2E6053692AB Xref: news.gmane.org gmane.linux.lib.musl.general:12390 Archived-At: On Wed, Jan 24, 2018 at 02:25:06PM -0800, Eric Pruitt wrote: > On Wed, Jan 24, 2018 at 04:48:53PM -0500, Rich Felker wrote: > > > I updated my copy of musl to 1.1.18 then recompiled it with and without > > > my utf8proc changes using GCC 6.3.0 "-O3" targeting Linux 4.9.0 / > > > x86_64: > > > > > > - Original implementation: 2,762,774B (musl-1.1.18/lib/libc.a) > > > - utf8proc implementation: 3,055,954B (musl-1.1.18/lib/libc.a) > > > - The utf8proc implementation is ~11% larger. I didn't do any > > > performance comparisons. > > > > You're comparing the whole library, not character tables. If you > > compare against all of ctype, it's a 15x size increase. If you compare > > against just wcwidth, it's a 69x increase. > > That was intentional. I have no clue what the common case is for other > people that use musl, but most applications **I** use make use of > various parts of musl, so I did the comparison on the library as a > whole. If the size of utf8proc tables is a problem, I'm not sure how you'd go about implementing UCA without them in an efficient manner. Part of the UCA requires normalizing the Unicode strings and also needs character property data to determine what sequence of characters in one string is compared to a sequence of characters in another string. Perhaps you could compromise by simply ignoring certain characters and not doing normalization at all. Since the utf8proc maintainer seems receptive to my proposed change, I'm going to implement the collation feature in utf8proc, and if you decide that utf8proc is worth the bloat, you'll get collation logic for "free." Eric