From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/12379 Path: news.gmane.org!.POSTED!not-for-mail From: Eric Pruitt Newsgroups: gmane.linux.lib.musl.general Subject: Re: Updating Unicode support Date: Tue, 23 Jan 2018 16:51:33 -0800 Message-ID: <20180124005133.pdcypbus23yrikgg@sinister.lan.codevat.com> References: <20180123015446.vera7ocpvgaqvkss@sinister.lan.codevat.com> <20180123233857.GW1627@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: blaine.gmane.org 1516755002 27625 195.159.176.226 (24 Jan 2018 00:50:02 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Wed, 24 Jan 2018 00:50:02 +0000 (UTC) User-Agent: NeoMutt/20170113 (1.7.2) To: musl@lists.openwall.com Original-X-From: musl-return-12395-gllmg-musl=m.gmane.org@lists.openwall.com Wed Jan 24 01:49:58 2018 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1ee9G8-0006P6-R8 for gllmg-musl@m.gmane.org; Wed, 24 Jan 2018 01:49:48 +0100 Original-Received: (qmail 31949 invoked by uid 550); 24 Jan 2018 00:51:48 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 31931 invoked from network); 24 Jan 2018 00:51:48 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:subject:message-id:references:mime-version :content-disposition:in-reply-to:pgp-key:user-agent; bh=YsBdxZECI6ErA+EyEeNgCREpQqhNaab9ZJufr1dzrFw=; b=IeYF+Zjff4EAbgyYQ6IIVWalpi/f6boe2T+03zc3brMKYffYoKb9w8+4dsQKkfQ0Ni jzpOC6DXi5TIpdqXBPWZsgaTGtHqhq7w/d9Gpb2fjhd+cb357Em1OWZHE6MDoeQpa+71 cZqB7wrxU8wP4jJFfdwrcfD/fBpJA04CVI9kI+6RN1/9yUcrICDWOR0ZgPWZNNFmGZID SkPW2Oqi44lzCnR/yZ4GJAKPtab3pIwT2uVppc6M256S+Ws+Jp+YML2iBswqXI3bxwiN eQG9KyL3y9QGktPKj2gBBbfg9be07EHDlN2Bgz8dSo4th2UrD4Zu/i48TUu2+EDwnxXl 4CsA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:subject:message-id:references :mime-version:content-disposition:in-reply-to:pgp-key:user-agent; bh=YsBdxZECI6ErA+EyEeNgCREpQqhNaab9ZJufr1dzrFw=; b=OmcN921c5Lo6Y09jutJeWU/APU+XRTpXUm86cSjew8iu+vZeQ+6ZGQXhktvAADQeZs raW7vMumL6DCBPhI1tKTdn3wKHFg86xHIgaC1MMoMUbZ2O3JCbv7teWM3ZWxd2DteUao BQLe3O99HrARDoX/Ck83eq7e9EMrxz4I4hbQ6De6Rzg6Bgaz0eOEXbimKdG7QFuhLUYt XHtHIPG2IUQMXKmxz6GmaIe3XzLVbWlReQRdmn3bRHRlHOYKh/m/ZMmzhYPGaq1NC/ar JMcu29KfOzas93i5X6hChhPFnauEdlgaXSb2xNmtxz/RVGMRFJ8YotTlJ4LxzC6cWmja h2gA== X-Gm-Message-State: AKwxytctGIblg4k7DN+XGUELeglsXDBOvMNn+SMZjwm0I3BHz0xzdou7 QLADY42K5eoOg9d0h320kd3T2Q== X-Google-Smtp-Source: AH8x227wvRn+wJ5kgJ90yxAnNCQrpiU5ShQuXcpJAg5/mr1YkuPAvKCaKw9K7hUzFUVx2qAtDKlkXA== X-Received: by 10.99.123.8 with SMTP id w8mr9484264pgc.201.1516755095547; Tue, 23 Jan 2018 16:51:35 -0800 (PST) Content-Disposition: inline In-Reply-To: <20180123233857.GW1627@brightrain.aerifal.cx> PGP-Key: https://www.codevat.com/pgp.asc#F8601B5D2511B4C3535232488DDDE2E6053692AB Xref: news.gmane.org gmane.linux.lib.musl.general:12379 Archived-At: On Tue, Jan 23, 2018 at 06:38:57PM -0500, Rich Felker wrote: > OK. With this in mind, I hope you're also aware that musl's Unicode > tables are all highly optimized for size and (aside from case mapping) > very good speed relative to their size, and are generated mechanically > from the UCD files via some ugly code here: > > https://github.com/richfelker/musl-chartable-tools The utf8proc library also uses optimized tables for property lookups. For example, retrieving properties for an individual character is done using a 2-stage lookup: // utf8proc.c:223 at commit 3a10df6 static const utf8proc_property_t *unsafe_get_property(utf8proc_int32_t uc) { /* ASSERT: uc >= 0 && uc < 0x110000 */ return utf8proc_properties + ( utf8proc_stage2table[ utf8proc_stage1table[uc >> 8] + (uc & 0xFF) ] ); } See for the gory details. It's on my TODO list to compare the size of the object files generated using utf8proc compared to musl's built-in tables. I'll post the results once I get around to it. It's not an issue for me personally because I don't use musl on any resource constrained systems, but I do appreciate and understand that this is a priority for you which is why I suggested making utf8proc an optional feature. > If you mean that emoji should be considered double-width, I agree with > that in principle, but everything has to *agree* upon widths in order > for them to work. If not, terminal contents just get corrupted when > programs or systems that disagree try to communicate. It would take a > coordinated effort with glibc, third-party libraries, and programs > like screen that ship their own wcwidth-equivalent tables to redefine > them as double-width, and ideally there should probably be some > Unicode recommendation to document the change. Hence the ability to compile the utf8proc-wcwidth.c as a shared library that can be used with LD_PRELOAD. Initially I thought everything would work out once all my applications used the same Unicode release, but I still noticed inconsistencies and rendering glitches. The final solution was using LD_PRELOAD to override wcwidth(3) and wcswidth(3) in applications that either I don't build myself (notably Mutt and M.O.C) or that I dynamically link -- currently just my graphical terminal emulator simply because I have no interest in trying to statically link against X11. My other frequently used CLI applications like Bash, GNU Awk, and tmux are compiled statically using musl libc with my utf8proc changes. Long story short, I control the entire rendering stack by building applications I care about myself or using LD_PRELOAD to bend the ones I don't to my will. I don't think I've had any rendering problems since I started doing things this way. > Do you have an example of characters that caused the problem? I'd like > to better understand how it came up. Maybe glibc is already doing > something different than what I think they're doing. I'll follow-up on this later. I need to recompile a few things before I can give you some concrete examples. I wrote a program for an unrelated project that I can use to compare the width data of glibc, musl libc and my utf8proc-based wcwidth(3), and I'll include that, too. > Thanks for pointing out this library -- it looks like something we > might should add to the wiki as a recommended lib, and seems to > implement a lot of Unicode functionality that's otherwise only > available in gigantic bloated libraries like ICU. I'd like to take a > closer look at it when I get time. I've been pretty happy with utf8proc so far. My only qualms with it are the lack of a pre-existing implementations of common POSIX functions and the relatively heavy toolchain used to generate its property tables; updating the property tables requires Julia, Ruby and FontForge. These programs are readily available for popular Linux distributions, but those applications aren't something I normally have installed on my hosts. I finished reviewing the Unicode Collation Algorithm, and it looks like utf8proc doesn't include the necessary collation information. This is understandable since different locales have different collation rules, but I'm going to propose adding DUCET, the Default Unicode Collation Element Table, on their issue tracker since it doesn't look like it's been discussed yet. > If someone wants to make local changes or upgrade to newer Unicode > before it's upstream in musl, these tools generally provide the best > way to do it. > > [...] > > Of course it's possible to drop it in to musl's tree locally like you > did as a hack, but this isn't something musl can really do due to both > namespace considerations (wcwidth depending on symbols not in reserved > namespace) and policy about not introducing config switches. But if > the table contents in utf8proc do differ from musl, you can always use > the chartable tools package to generate matching tables to drop into > musl. Either I overlooked musl-chartable-tools when I was trying to figure out how to update musl's Unicode tables or they hadn't been posted to the wiki when I last checked. As mentioned above, I'll do some comparisons and get back to you. Eric