From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/12385 Path: news.gmane.org!.POSTED!not-for-mail From: Eric Pruitt Newsgroups: gmane.linux.lib.musl.general Subject: Re: Updating Unicode support Date: Tue, 23 Jan 2018 22:26:02 -0800 Message-ID: <20180124062602.3nn7xiwo4mgor57y@sinister.lan.codevat.com> References: <20180123015446.vera7ocpvgaqvkss@sinister.lan.codevat.com> <20180123233857.GW1627@brightrain.aerifal.cx> <20180124005133.pdcypbus23yrikgg@sinister.lan.codevat.com> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: blaine.gmane.org 1516775068 25710 195.159.176.226 (24 Jan 2018 06:24:28 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Wed, 24 Jan 2018 06:24:28 +0000 (UTC) User-Agent: NeoMutt/20170113 (1.7.2) To: musl@lists.openwall.com Original-X-From: musl-return-12401-gllmg-musl=m.gmane.org@lists.openwall.com Wed Jan 24 07:24:24 2018 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1eeETn-0005tm-6p for gllmg-musl@m.gmane.org; Wed, 24 Jan 2018 07:24:15 +0100 Original-Received: (qmail 1606 invoked by uid 550); 24 Jan 2018 06:26:17 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 1567 invoked from network); 24 Jan 2018 06:26:16 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:subject:message-id:references:mime-version :content-disposition:in-reply-to:pgp-key:user-agent; bh=hvX0267lTmoYRzA/xBqsKogcUb2hh10/VO1wrMobPVw=; b=cMcRbFAJM+S0c63S6icq4/3wJ9EBzIdn1Pn+uyrpmw0SYpZIQMuZ59LKbNhf7bOkSx bfhlcgEqzvyCsfMNPFhyAQOqJaMProosWieYJIwK801iIM++60iZ71mnxqqvlXJP8y4/ B+X2VfM6+a4YnSltbk6ejIaIdeu4mkV9QerBOMhc2BVjirvkKyP/8eVzgT9TZBnUFexC qtIrHox6ewUY/Hf3r6vPv0XnjKnM3o83EqxsWeEkHrYVQ6ERY12zZduzVnn/cCt0qyQZ 5rXJWd+Q1jgjWDlqoyvzZGY2QVQ13qca11InxpOPlU0+wUeG7+AuKoBLg3aOu0sAnr3u 36qQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:subject:message-id:references :mime-version:content-disposition:in-reply-to:pgp-key:user-agent; bh=hvX0267lTmoYRzA/xBqsKogcUb2hh10/VO1wrMobPVw=; b=o2qUr7FtU9jDNXWoUVmFkUhxEJ/yaGB+1Rh603k4S4FswyXt55fx2QwaUbCpllZIxI vdlobCGrDsJTt/fQHskcYhpj0yzVNW19LkheBGggL4RIkvZiGmSmGFR+rRslCTto1WUL MQVrcQR21E4JB5Z2vExZiXHFI6WkaSIzgBWi75BVRS8EYvQ6q4HHIUMpZ0ppSm/aEgfU a5GlXCmJSKL6h18FiNzKxQTUFyIV/vN8+MANSyJh4+DT03u0tJ808WYWaSXgRT80dBDA +2qOcJHKrqmATHafXcaEuuv96ZKpecdsU1bymOmiwqbMizhpMohWwbdFHgeIgofQZwV4 pueg== X-Gm-Message-State: AKwxytco9zR40yq86mXfpmaYwYcfHQqFBPz14MhRxkgXwGC/5g8/tyZL JUsqObzdaYRFMHf71H1mVbi3rQ== X-Google-Smtp-Source: AH8x226rERpm8ka74d8s/b40Mw0WVAPhUCuHuXAceGVMh/Mprx7YPHPbsuZngwmkDBEKQthH4412Cg== X-Received: by 2002:a17:902:560f:: with SMTP id h15-v6mr6989228pli.75.1516775164139; Tue, 23 Jan 2018 22:26:04 -0800 (PST) Content-Disposition: inline In-Reply-To: <20180124005133.pdcypbus23yrikgg@sinister.lan.codevat.com> PGP-Key: https://www.codevat.com/pgp.asc#F8601B5D2511B4C3535232488DDDE2E6053692AB Xref: news.gmane.org gmane.linux.lib.musl.general:12385 Archived-At: On Tue, Jan 23, 2018 at 04:51:33PM -0800, Eric Pruitt wrote: > On Tue, Jan 23, 2018 at 06:38:57PM -0500, Rich Felker wrote: > > OK. With this in mind, I hope you're also aware that musl's Unicode > > tables are all highly optimized for size and (aside from case mapping) > > very good speed relative to their size, and are generated mechanically > > from the UCD files via some ugly code here: > > > > https://github.com/richfelker/musl-chartable-tools I updated my copy of musl to 1.1.18 then recompiled it with and without my utf8proc changes using GCC 6.3.0 "-O3" targeting Linux 4.9.0 / x86_64: - Original implementation: 2,762,774B (musl-1.1.18/lib/libc.a) - utf8proc implementation: 3,055,954B (musl-1.1.18/lib/libc.a) - The utf8proc implementation is ~11% larger. I didn't do any performance comparisons. > > Do you have an example of characters that caused the problem? I'd like > > to better understand how it came up. Maybe glibc is already doing > > something different than what I think they're doing. > > I'll follow-up on this later. I need to recompile a few things before I > can give you some concrete examples. I wrote a program for an unrelated > project that I can use to compare the width data of glibc, musl libc and > my utf8proc-based wcwidth(3), and I'll include that, too. > > [...] > > Either I overlooked musl-chartable-tools when I was trying to figure out > how to update musl's Unicode tables or they hadn't been posted to the > wiki when I last checked. As mentioned above, I'll do some comparisons > and get back to you. I'm using Debian 9, and the version of glibc it ships with (2.24) uses Unicode 9. Since musl-1.1.18 uses Unicode 10 data, I'll have to rebuild the character tables to do proper comparisons. The text files in musl-chartable-tools appear to be out of date: data$ head -n5 *.txt ==> DerivedCoreProperties.txt <== # DerivedCoreProperties-6.1.0.txt # Date: 2011-12-11, 18:26:55 GMT [MD] # # Unicode Character Database # Copyright (c) 1991-2011 Unicode, Inc. ==> EastAsianWidth.txt <== # EastAsianWidth-6.1.0.txt # Date: 2011-09-19, 18:46:00 GMT [KW] # # East Asian Width Properties # I know the updated versions of the text files can be downloaded from . Could you please verify whether the version of the code that was used to create and has been pushed to ? > I finished reviewing the Unicode Collation Algorithm, and it looks like > utf8proc doesn't include the necessary collation information. This is > understandable since different locales have different collation rules, > but I'm going to propose adding DUCET, the Default Unicode Collation > Element Table, on their issue tracker since it doesn't look like it's > been discussed yet. I opened https://github.com/JuliaLang/utf8proc earlier today. Eric