mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Subject: Re: Updating Unicode support
Date: Wed, 24 Jan 2018 16:45:39 -0500	[thread overview]
Message-ID: <20180124214539.GY1627@brightrain.aerifal.cx> (raw)
In-Reply-To: <20180124062602.3nn7xiwo4mgor57y@sinister.lan.codevat.com>

On Tue, Jan 23, 2018 at 10:26:02PM -0800, Eric Pruitt wrote:
> On Tue, Jan 23, 2018 at 04:51:33PM -0800, Eric Pruitt wrote:
> > On Tue, Jan 23, 2018 at 06:38:57PM -0500, Rich Felker wrote:
> > > OK. With this in mind, I hope you're also aware that musl's Unicode
> > > tables are all highly optimized for size and (aside from case mapping)
> > > very good speed relative to their size, and are generated mechanically
> > > from the UCD files via some ugly code here:
> > >
> > > https://github.com/richfelker/musl-chartable-tools
> 
> I updated my copy of musl to 1.1.18 then recompiled it with and without
> my utf8proc changes using GCC 6.3.0 "-O3" targeting Linux 4.9.0 /
> x86_64:
> 
> - Original implementation: 2,762,774B (musl-1.1.18/lib/libc.a)
> - utf8proc implementation: 3,055,954B (musl-1.1.18/lib/libc.a)
> - The utf8proc implementation is ~11% larger. I didn't do any
>   performance comparisons.
> 
> > > Do you have an example of characters that caused the problem? I'd like
> > > to better understand how it came up. Maybe glibc is already doing
> > > something different than what I think they're doing.
> >
> > I'll follow-up on this later. I need to recompile a few things before I
> > can give you some concrete examples. I wrote a program for an unrelated
> > project that I can use to compare the width data of glibc, musl libc and
> > my utf8proc-based wcwidth(3), and I'll include that, too.
> >
> > [...]
> >
> > Either I overlooked musl-chartable-tools when I was trying to figure out
> > how to update musl's Unicode tables or they hadn't been posted to the
> > wiki when I last checked. As mentioned above, I'll do some comparisons
> > and get back to you.
> 
> I'm using Debian 9, and the version of glibc it ships with (2.24) uses
> Unicode 9. Since musl-1.1.18 uses Unicode 10 data, I'll have to rebuild
> the character tables to do proper comparisons. The text files in
> musl-chartable-tools appear to be out of date:
> 
>     data$ head -n5 *.txt
>     ==> DerivedCoreProperties.txt <==
>     # DerivedCoreProperties-6.1.0.txt
>     # Date: 2011-12-11, 18:26:55 GMT [MD]
>     #
>     # Unicode Character Database
>     # Copyright (c) 1991-2011 Unicode, Inc.
> 
>     ==> EastAsianWidth.txt <==
>     # EastAsianWidth-6.1.0.txt
>     # Date: 2011-09-19, 18:46:00 GMT [KW]
>     #
>     # East Asian Width Properties
>     #
> 
> I know the updated versions of the text files can be downloaded from
> <https://www.unicode.org/Public/10.0.0/ucd/>. Could you please verify
> whether the version of the code that was used to create
> <https://git.musl-libc.org/cgit/musl/commit/?id=c72c1c5> and
> <https://git.musl-libc.org/cgit/musl/commit/?id=54941ed> has been pushed
> to <https://github.com/richfelker/musl-chartable-tools>?

Indeed, it wasn't pushed -- sorry. Done now.

> > I finished reviewing the Unicode Collation Algorithm, and it looks like
> > utf8proc doesn't include the necessary collation information. This is
> > understandable since different locales have different collation rules,
> > but I'm going to propose adding DUCET, the Default Unicode Collation
> > Element Table, on their issue tracker since it doesn't look like it's
> > been discussed yet.
> 
> I opened https://github.com/JuliaLang/utf8proc earlier today.

You mentioned it earlier, and yes, collation is also an open problem
for musl. I want to do it based on UCA, not the POSIX localedef form
of collation tables.

Rich


  reply	other threads:[~2018-01-24 21:45 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-23  1:54 Eric Pruitt
2018-01-23 23:38 ` Rich Felker
2018-01-24  0:51   ` Eric Pruitt
2018-01-24  6:26     ` Eric Pruitt
2018-01-24 21:45       ` Rich Felker [this message]
2018-01-24 22:22         ` Eric Pruitt
2018-01-24 21:48       ` Rich Felker
2018-01-24 22:25         ` Eric Pruitt
2018-01-24 22:53           ` Eric Pruitt
2018-01-24 23:32             ` Rich Felker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180124214539.GY1627@brightrain.aerifal.cx \
    --to=dalias@libc.org \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).