From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/12379
Path: news.gmane.org!.POSTED!not-for-mail
From: Eric Pruitt <eric.pruitt@gmail.com>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: Updating Unicode support
Date: Tue, 23 Jan 2018 16:51:33 -0800
Message-ID: <20180124005133.pdcypbus23yrikgg@sinister.lan.codevat.com>
References: <20180123015446.vera7ocpvgaqvkss@sinister.lan.codevat.com>
 <20180123233857.GW1627@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: blaine.gmane.org 1516755002 27625 195.159.176.226 (24 Jan 2018 00:50:02 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Wed, 24 Jan 2018 00:50:02 +0000 (UTC)
User-Agent: NeoMutt/20170113 (1.7.2)
To: musl@lists.openwall.com
Original-X-From: musl-return-12395-gllmg-musl=m.gmane.org@lists.openwall.com Wed Jan 24 01:49:58 2018
Return-path: <musl-return-12395-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by blaine.gmane.org with smtp (Exim 4.84_2)
	(envelope-from <musl-return-12395-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1ee9G8-0006P6-R8
	for gllmg-musl@m.gmane.org; Wed, 24 Jan 2018 01:49:48 +0100
Original-Received: (qmail 31949 invoked by uid 550); 24 Jan 2018 00:51:48 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Original-Received: (qmail 31931 invoked from network); 24 Jan 2018 00:51:48 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=date:from:to:subject:message-id:references:mime-version
         :content-disposition:in-reply-to:pgp-key:user-agent;
        bh=YsBdxZECI6ErA+EyEeNgCREpQqhNaab9ZJufr1dzrFw=;
        b=IeYF+Zjff4EAbgyYQ6IIVWalpi/f6boe2T+03zc3brMKYffYoKb9w8+4dsQKkfQ0Ni
         jzpOC6DXi5TIpdqXBPWZsgaTGtHqhq7w/d9Gpb2fjhd+cb357Em1OWZHE6MDoeQpa+71
         cZqB7wrxU8wP4jJFfdwrcfD/fBpJA04CVI9kI+6RN1/9yUcrICDWOR0ZgPWZNNFmGZID
         SkPW2Oqi44lzCnR/yZ4GJAKPtab3pIwT2uVppc6M256S+Ws+Jp+YML2iBswqXI3bxwiN
         eQG9KyL3y9QGktPKj2gBBbfg9be07EHDlN2Bgz8dSo4th2UrD4Zu/i48TUu2+EDwnxXl
         4CsA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:subject:message-id:references
         :mime-version:content-disposition:in-reply-to:pgp-key:user-agent;
        bh=YsBdxZECI6ErA+EyEeNgCREpQqhNaab9ZJufr1dzrFw=;
        b=OmcN921c5Lo6Y09jutJeWU/APU+XRTpXUm86cSjew8iu+vZeQ+6ZGQXhktvAADQeZs
         raW7vMumL6DCBPhI1tKTdn3wKHFg86xHIgaC1MMoMUbZ2O3JCbv7teWM3ZWxd2DteUao
         BQLe3O99HrARDoX/Ck83eq7e9EMrxz4I4hbQ6De6Rzg6Bgaz0eOEXbimKdG7QFuhLUYt
         XHtHIPG2IUQMXKmxz6GmaIe3XzLVbWlReQRdmn3bRHRlHOYKh/m/ZMmzhYPGaq1NC/ar
         JMcu29KfOzas93i5X6hChhPFnauEdlgaXSb2xNmtxz/RVGMRFJ8YotTlJ4LxzC6cWmja
         h2gA==
X-Gm-Message-State: AKwxytctGIblg4k7DN+XGUELeglsXDBOvMNn+SMZjwm0I3BHz0xzdou7
	QLADY42K5eoOg9d0h320kd3T2Q==
X-Google-Smtp-Source: AH8x227wvRn+wJ5kgJ90yxAnNCQrpiU5ShQuXcpJAg5/mr1YkuPAvKCaKw9K7hUzFUVx2qAtDKlkXA==
X-Received: by 10.99.123.8 with SMTP id w8mr9484264pgc.201.1516755095547;
        Tue, 23 Jan 2018 16:51:35 -0800 (PST)
Content-Disposition: inline
In-Reply-To: <20180123233857.GW1627@brightrain.aerifal.cx>
PGP-Key: https://www.codevat.com/pgp.asc#F8601B5D2511B4C3535232488DDDE2E6053692AB
Xref: news.gmane.org gmane.linux.lib.musl.general:12379
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/12379>

On Tue, Jan 23, 2018 at 06:38:57PM -0500, Rich Felker wrote:
> OK. With this in mind, I hope you're also aware that musl's Unicode
> tables are all highly optimized for size and (aside from case mapping)
> very good speed relative to their size, and are generated mechanically
> from the UCD files via some ugly code here:
>
> https://github.com/richfelker/musl-chartable-tools

The utf8proc library also uses optimized tables for property lookups.
For example, retrieving properties for an individual character is done
using a 2-stage lookup:

    // utf8proc.c:223 at commit 3a10df6
    static const utf8proc_property_t *unsafe_get_property(utf8proc_int32_t
    uc) {
      /* ASSERT: uc >= 0 && uc < 0x110000 */
      return utf8proc_properties + (
        utf8proc_stage2table[
          utf8proc_stage1table[uc >> 8] + (uc & 0xFF)
        ]
      );
    }

See <https://github.com/JuliaLang/utf8proc/tree/95fc75b/data> for the
gory details. It's on my TODO list to compare the size of the object
files generated using utf8proc compared to musl's built-in tables. I'll
post the results once I get around to it. It's not an issue for me
personally because I don't use musl on any resource constrained systems,
but I do appreciate and understand that this is a priority for you which
is why I suggested making utf8proc an optional feature.

> If you mean that emoji should be considered double-width, I agree with
> that in principle, but everything has to *agree* upon widths in order
> for them to work. If not, terminal contents just get corrupted when
> programs or systems that disagree try to communicate. It would take a
> coordinated effort with glibc, third-party libraries, and programs
> like screen that ship their own wcwidth-equivalent tables to redefine
> them as double-width, and ideally there should probably be some
> Unicode recommendation to document the change.

Hence the ability to compile the utf8proc-wcwidth.c as a shared library
that can be used with LD_PRELOAD. Initially I thought everything would
work out once all my applications used the same Unicode release, but I
still noticed inconsistencies and rendering glitches. The final solution
was using LD_PRELOAD to override wcwidth(3) and wcswidth(3) in
applications that either I don't build myself (notably Mutt and M.O.C)
or that I dynamically link -- currently just my graphical terminal
emulator simply because I have no interest in trying to statically link
against X11.

My other frequently used CLI applications like Bash, GNU Awk, and tmux
are compiled statically using musl libc with my utf8proc changes. Long
story short, I control the entire rendering stack by building
applications I care about myself or using LD_PRELOAD to bend the ones I
don't to my will. I don't think I've had any rendering problems since I
started doing things this way.

> Do you have an example of characters that caused the problem? I'd like
> to better understand how it came up. Maybe glibc is already doing
> something different than what I think they're doing.

I'll follow-up on this later. I need to recompile a few things before I
can give you some concrete examples. I wrote a program for an unrelated
project that I can use to compare the width data of glibc, musl libc and
my utf8proc-based wcwidth(3), and I'll include that, too.

> Thanks for pointing out this library -- it looks like something we
> might should add to the wiki as a recommended lib, and seems to
> implement a lot of Unicode functionality that's otherwise only
> available in gigantic bloated libraries like ICU. I'd like to take a
> closer look at it when I get time.

I've been pretty happy with utf8proc so far. My only qualms with it are
the lack of a pre-existing implementations of common POSIX functions and
the relatively heavy toolchain used to generate its property tables;
updating the property tables requires Julia, Ruby and FontForge. These
programs are readily available for popular Linux distributions, but
those applications aren't something I normally have installed on my
hosts.

I finished reviewing the Unicode Collation Algorithm, and it looks like
utf8proc doesn't include the necessary collation information. This is
understandable since different locales have different collation rules,
but I'm going to propose adding DUCET, the Default Unicode Collation
Element Table, on their issue tracker since it doesn't look like it's
been discussed yet.

> If someone wants to make local changes or upgrade to newer Unicode
> before it's upstream in musl, these tools generally provide the best
> way to do it.
>
> [...]
>
> Of course it's possible to drop it in to musl's tree locally like you
> did as a hack, but this isn't something musl can really do due to both
> namespace considerations (wcwidth depending on symbols not in reserved
> namespace) and policy about not introducing config switches. But if
> the table contents in utf8proc do differ from musl, you can always use
> the chartable tools package to generate matching tables to drop into
> musl.

Either I overlooked musl-chartable-tools when I was trying to figure out
how to update musl's Unicode tables or they hadn't been posted to the
wiki when I last checked. As mentioned above, I'll do some comparisons
and get back to you.

Eric