Updating Unicode support

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Updating Unicode support
@ 2018-01-23  1:54 Eric Pruitt
  2018-01-23 23:38 ` Rich Felker
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Pruitt @ 2018-01-23  1:54 UTC (permalink / raw)
  To: musl

NOTE: When I first started writing this email, I didn't realize musl's
Unicode property table had recently been updated, but I noticed
<https://git.musl-libc.org/cgit/musl/commit/?id=c72c1c5> when I was
looking up commit IDs to cite. I'm leaving most of the verbiage below
unchanged since I think it adds useful context.

The Unicode property data used by musl has not been updated in quite
some time, and due to changes introduced in recent publications of the
Unicode standard, musl's width data is incorrect for many symbols --
notably emoji. This can lead to rendering glitches in terminals when
some applications are not built with musl; for example, my terminal
emulator is dynamically linked against a version of GNU libc that
supports Unicode 9 (released June 21, 2016) whereas musl's table was
lasted updated in 2011 or 2012 (commit 1b0ce9a).

To resolve this problem, I wrote a drop-in replacement for musl's
wcwidth(3) implementation that uses utf8proc
(https://github.com/JuliaLang/utf8proc) as the source of truth. You can
find the code for this at
<https://github.com/ericpruitt/static-unix-userland/blob/42cbdbb/utf8proc-wcwidth/utf8proc-wcwidth.c>.
I am wondering if the musl developers would consider accepting a patch
that implements optional / configurable support for utf8proc. The
utf8proc-wcwidth.c file I linked to includes some additional code
unrelated to musl making it possible to use the file as an LD_PRELOAD
library. The LD_PRELOAD stuff would **not** be include in the proposed
patch. I'm also investigating implementing the Unicode Collation
Algorithm (https://unicode.org/reports/tr10/) for wcscoll(3); would that
be of interest?

Thanks,
Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Updating Unicode support
  2018-01-23  1:54 Updating Unicode support Eric Pruitt
@ 2018-01-23 23:38 ` Rich Felker
  2018-01-24  0:51   ` Eric Pruitt
  0 siblings, 1 reply; 10+ messages in thread
From: Rich Felker @ 2018-01-23 23:38 UTC (permalink / raw)
  To: musl

On Mon, Jan 22, 2018 at 05:54:49PM -0800, Eric Pruitt wrote:
> NOTE: When I first started writing this email, I didn't realize musl's
> Unicode property table had recently been updated, but I noticed
> <https://git.musl-libc.org/cgit/musl/commit/?id=c72c1c5> when I was
> looking up commit IDs to cite. I'm leaving most of the verbiage below
> unchanged since I think it adds useful context.

OK. With this in mind, I hope you're also aware that musl's Unicode
tables are all highly optimized for size and (aside from case mapping)
very good speed relative to their size, and are generated mechanically
from the UCD files via some ugly code here:

https://github.com/richfelker/musl-chartable-tools

If someone wants to make local changes or upgrade to newer Unicode
before it's upstream in musl, these tools generally provide the best
way to do it.

> The Unicode property data used by musl has not been updated in quite
> some time, and due to changes introduced in recent publications of the
> Unicode standard, musl's width data is incorrect for many symbols --
> notably emoji.

If you mean that emoji should be considered double-width, I agree with
that in principle, but everything has to *agree* upon widths in order
for them to work. If not, terminal contents just get corrupted when
programs or systems that disagree try to communicate. It would take a
coordinated effort with glibc, third-party libraries, and programs
like screen that ship their own wcwidth-equivalent tables to redefine
them as double-width, and ideally there should probably be some
Unicode recommendation to document the change.

Note that musl assumes all characters that aren't already defined as
control or nonspacing are single-width, except in the extended CJK
planes where it assumes they're double-width, so lack of support for
latest Unicode is only a problem when new nonspacing characters are
added or when wide characters are added outside the CJK planes.

> This can lead to rendering glitches in terminals when
> some applications are not built with musl; for example, my terminal
> emulator is dynamically linked against a version of GNU libc that
> supports Unicode 9 (released June 21, 2016) whereas musl's table was
> lasted updated in 2011 or 2012 (commit 1b0ce9a).

Do you have an example of characters that caused the problem? I'd like
to better understand how it came up. Maybe glibc is already doing
something different than what I think they're doing.

> To resolve this problem, I wrote a drop-in replacement for musl's
> wcwidth(3) implementation that uses utf8proc
> (https://github.com/JuliaLang/utf8proc) as the source of truth. You can
> find the code for this at
> <https://github.com/ericpruitt/static-unix-userland/blob/42cbdbb/utf8proc-wcwidth/utf8proc-wcwidth.c>.
> I am wondering if the musl developers would consider accepting a patch
> that implements optional / configurable support for utf8proc. The
> utf8proc-wcwidth.c file I linked to includes some additional code
> unrelated to musl making it possible to use the file as an LD_PRELOAD
> library. The LD_PRELOAD stuff would **not** be include in the proposed
> patch. I'm also investigating implementing the Unicode Collation
> Algorithm (https://unicode.org/reports/tr10/) for wcscoll(3); would that
> be of interest?

Thanks for pointing out this library -- it looks like something we
might should add to the wiki as a recommended lib, and seems to
implement a lot of Unicode functionality that's otherwise only
available in gigantic bloated libraries like ICU. I'd like to take a
closer look at it when I get time.

Of course it's possible to drop it in to musl's tree locally like you
did as a hack, but this isn't something musl can really do due to both
namespace considerations (wcwidth depending on symbols not in reserved
namespace) and policy about not introducing config switches. But if
the table contents in utf8proc do differ from musl, you can always use
the chartable tools package to generate matching tables to drop into
musl.

Rich

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Updating Unicode support
  2018-01-23 23:38 ` Rich Felker
@ 2018-01-24  0:51   ` Eric Pruitt
  2018-01-24  6:26     ` Eric Pruitt
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Pruitt @ 2018-01-24  0:51 UTC (permalink / raw)
  To: musl

On Tue, Jan 23, 2018 at 06:38:57PM -0500, Rich Felker wrote:
> OK. With this in mind, I hope you're also aware that musl's Unicode
> tables are all highly optimized for size and (aside from case mapping)
> very good speed relative to their size, and are generated mechanically
> from the UCD files via some ugly code here:
>
> https://github.com/richfelker/musl-chartable-tools

The utf8proc library also uses optimized tables for property lookups.
For example, retrieving properties for an individual character is done
using a 2-stage lookup:

    // utf8proc.c:223 at commit 3a10df6
    static const utf8proc_property_t *unsafe_get_property(utf8proc_int32_t
    uc) {
      /* ASSERT: uc >= 0 && uc < 0x110000 */
      return utf8proc_properties + (
        utf8proc_stage2table[
          utf8proc_stage1table[uc >> 8] + (uc & 0xFF)
        ]
      );
    }

See <https://github.com/JuliaLang/utf8proc/tree/95fc75b/data> for the
gory details. It's on my TODO list to compare the size of the object
files generated using utf8proc compared to musl's built-in tables. I'll
post the results once I get around to it. It's not an issue for me
personally because I don't use musl on any resource constrained systems,
but I do appreciate and understand that this is a priority for you which
is why I suggested making utf8proc an optional feature.

> If you mean that emoji should be considered double-width, I agree with
> that in principle, but everything has to *agree* upon widths in order
> for them to work. If not, terminal contents just get corrupted when
> programs or systems that disagree try to communicate. It would take a
> coordinated effort with glibc, third-party libraries, and programs
> like screen that ship their own wcwidth-equivalent tables to redefine
> them as double-width, and ideally there should probably be some
> Unicode recommendation to document the change.

Hence the ability to compile the utf8proc-wcwidth.c as a shared library
that can be used with LD_PRELOAD. Initially I thought everything would
work out once all my applications used the same Unicode release, but I
still noticed inconsistencies and rendering glitches. The final solution
was using LD_PRELOAD to override wcwidth(3) and wcswidth(3) in
applications that either I don't build myself (notably Mutt and M.O.C)
or that I dynamically link -- currently just my graphical terminal
emulator simply because I have no interest in trying to statically link
against X11.

My other frequently used CLI applications like Bash, GNU Awk, and tmux
are compiled statically using musl libc with my utf8proc changes. Long
story short, I control the entire rendering stack by building
applications I care about myself or using LD_PRELOAD to bend the ones I
don't to my will. I don't think I've had any rendering problems since I
started doing things this way.

> Do you have an example of characters that caused the problem? I'd like
> to better understand how it came up. Maybe glibc is already doing
> something different than what I think they're doing.

I'll follow-up on this later. I need to recompile a few things before I
can give you some concrete examples. I wrote a program for an unrelated
project that I can use to compare the width data of glibc, musl libc and
my utf8proc-based wcwidth(3), and I'll include that, too.

> Thanks for pointing out this library -- it looks like something we
> might should add to the wiki as a recommended lib, and seems to
> implement a lot of Unicode functionality that's otherwise only
> available in gigantic bloated libraries like ICU. I'd like to take a
> closer look at it when I get time.

I've been pretty happy with utf8proc so far. My only qualms with it are
the lack of a pre-existing implementations of common POSIX functions and
the relatively heavy toolchain used to generate its property tables;
updating the property tables requires Julia, Ruby and FontForge. These
programs are readily available for popular Linux distributions, but
those applications aren't something I normally have installed on my
hosts.

I finished reviewing the Unicode Collation Algorithm, and it looks like
utf8proc doesn't include the necessary collation information. This is
understandable since different locales have different collation rules,
but I'm going to propose adding DUCET, the Default Unicode Collation
Element Table, on their issue tracker since it doesn't look like it's
been discussed yet.

> If someone wants to make local changes or upgrade to newer Unicode
> before it's upstream in musl, these tools generally provide the best
> way to do it.
>
> [...]
>
> Of course it's possible to drop it in to musl's tree locally like you
> did as a hack, but this isn't something musl can really do due to both
> namespace considerations (wcwidth depending on symbols not in reserved
> namespace) and policy about not introducing config switches. But if
> the table contents in utf8proc do differ from musl, you can always use
> the chartable tools package to generate matching tables to drop into
> musl.

Either I overlooked musl-chartable-tools when I was trying to figure out
how to update musl's Unicode tables or they hadn't been posted to the
wiki when I last checked. As mentioned above, I'll do some comparisons
and get back to you.

Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Updating Unicode support
  2018-01-24  0:51   ` Eric Pruitt
@ 2018-01-24  6:26     ` Eric Pruitt
  2018-01-24 21:45       ` Rich Felker
  2018-01-24 21:48       ` Rich Felker
  0 siblings, 2 replies; 10+ messages in thread
From: Eric Pruitt @ 2018-01-24  6:26 UTC (permalink / raw)
  To: musl

On Tue, Jan 23, 2018 at 04:51:33PM -0800, Eric Pruitt wrote:
> On Tue, Jan 23, 2018 at 06:38:57PM -0500, Rich Felker wrote:
> > OK. With this in mind, I hope you're also aware that musl's Unicode
> > tables are all highly optimized for size and (aside from case mapping)
> > very good speed relative to their size, and are generated mechanically
> > from the UCD files via some ugly code here:
> >
> > https://github.com/richfelker/musl-chartable-tools

I updated my copy of musl to 1.1.18 then recompiled it with and without
my utf8proc changes using GCC 6.3.0 "-O3" targeting Linux 4.9.0 /
x86_64:

- Original implementation: 2,762,774B (musl-1.1.18/lib/libc.a)
- utf8proc implementation: 3,055,954B (musl-1.1.18/lib/libc.a)
- The utf8proc implementation is ~11% larger. I didn't do any
  performance comparisons.

> > Do you have an example of characters that caused the problem? I'd like
> > to better understand how it came up. Maybe glibc is already doing
> > something different than what I think they're doing.
>
> I'll follow-up on this later. I need to recompile a few things before I
> can give you some concrete examples. I wrote a program for an unrelated
> project that I can use to compare the width data of glibc, musl libc and
> my utf8proc-based wcwidth(3), and I'll include that, too.
>
> [...]
>
> Either I overlooked musl-chartable-tools when I was trying to figure out
> how to update musl's Unicode tables or they hadn't been posted to the
> wiki when I last checked. As mentioned above, I'll do some comparisons
> and get back to you.

I'm using Debian 9, and the version of glibc it ships with (2.24) uses
Unicode 9. Since musl-1.1.18 uses Unicode 10 data, I'll have to rebuild
the character tables to do proper comparisons. The text files in
musl-chartable-tools appear to be out of date:

    data$ head -n5 *.txt
    ==> DerivedCoreProperties.txt <==
    # DerivedCoreProperties-6.1.0.txt
    # Date: 2011-12-11, 18:26:55 GMT [MD]
    #
    # Unicode Character Database
    # Copyright (c) 1991-2011 Unicode, Inc.

    ==> EastAsianWidth.txt <==
    # EastAsianWidth-6.1.0.txt
    # Date: 2011-09-19, 18:46:00 GMT [KW]
    #
    # East Asian Width Properties
    #

I know the updated versions of the text files can be downloaded from
<https://www.unicode.org/Public/10.0.0/ucd/>. Could you please verify
whether the version of the code that was used to create
<https://git.musl-libc.org/cgit/musl/commit/?id=c72c1c5> and
<https://git.musl-libc.org/cgit/musl/commit/?id=54941ed> has been pushed
to <https://github.com/richfelker/musl-chartable-tools>?

> I finished reviewing the Unicode Collation Algorithm, and it looks like
> utf8proc doesn't include the necessary collation information. This is
> understandable since different locales have different collation rules,
> but I'm going to propose adding DUCET, the Default Unicode Collation
> Element Table, on their issue tracker since it doesn't look like it's
> been discussed yet.

I opened https://github.com/JuliaLang/utf8proc earlier today.

Eric


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Updating Unicode support
  2018-01-24  6:26     ` Eric Pruitt
@ 2018-01-24 21:45       ` Rich Felker
  2018-01-24 22:22         ` Eric Pruitt
  2018-01-24 21:48       ` Rich Felker
  1 sibling, 1 reply; 10+ messages in thread
From: Rich Felker @ 2018-01-24 21:45 UTC (permalink / raw)
  To: musl

On Tue, Jan 23, 2018 at 10:26:02PM -0800, Eric Pruitt wrote:
> On Tue, Jan 23, 2018 at 04:51:33PM -0800, Eric Pruitt wrote:
> > On Tue, Jan 23, 2018 at 06:38:57PM -0500, Rich Felker wrote:
> > > OK. With this in mind, I hope you're also aware that musl's Unicode
> > > tables are all highly optimized for size and (aside from case mapping)
> > > very good speed relative to their size, and are generated mechanically
> > > from the UCD files via some ugly code here:
> > >
> > > https://github.com/richfelker/musl-chartable-tools
> 
> I updated my copy of musl to 1.1.18 then recompiled it with and without
> my utf8proc changes using GCC 6.3.0 "-O3" targeting Linux 4.9.0 /
> x86_64:
> 
> - Original implementation: 2,762,774B (musl-1.1.18/lib/libc.a)
> - utf8proc implementation: 3,055,954B (musl-1.1.18/lib/libc.a)
> - The utf8proc implementation is ~11% larger. I didn't do any
>   performance comparisons.
> 
> > > Do you have an example of characters that caused the problem? I'd like
> > > to better understand how it came up. Maybe glibc is already doing
> > > something different than what I think they're doing.
> >
> > I'll follow-up on this later. I need to recompile a few things before I
> > can give you some concrete examples. I wrote a program for an unrelated
> > project that I can use to compare the width data of glibc, musl libc and
> > my utf8proc-based wcwidth(3), and I'll include that, too.
> >
> > [...]
> >
> > Either I overlooked musl-chartable-tools when I was trying to figure out
> > how to update musl's Unicode tables or they hadn't been posted to the
> > wiki when I last checked. As mentioned above, I'll do some comparisons
> > and get back to you.
> 
> I'm using Debian 9, and the version of glibc it ships with (2.24) uses
> Unicode 9. Since musl-1.1.18 uses Unicode 10 data, I'll have to rebuild
> the character tables to do proper comparisons. The text files in
> musl-chartable-tools appear to be out of date:
> 
>     data$ head -n5 *.txt
>     ==> DerivedCoreProperties.txt <==
>     # DerivedCoreProperties-6.1.0.txt
>     # Date: 2011-12-11, 18:26:55 GMT [MD]
>     #
>     # Unicode Character Database
>     # Copyright (c) 1991-2011 Unicode, Inc.
> 
>     ==> EastAsianWidth.txt <==
>     # EastAsianWidth-6.1.0.txt
>     # Date: 2011-09-19, 18:46:00 GMT [KW]
>     #
>     # East Asian Width Properties
>     #
> 
> I know the updated versions of the text files can be downloaded from
> <https://www.unicode.org/Public/10.0.0/ucd/>. Could you please verify
> whether the version of the code that was used to create
> <https://git.musl-libc.org/cgit/musl/commit/?id=c72c1c5> and
> <https://git.musl-libc.org/cgit/musl/commit/?id=54941ed> has been pushed
> to <https://github.com/richfelker/musl-chartable-tools>?

Indeed, it wasn't pushed -- sorry. Done now.

> > I finished reviewing the Unicode Collation Algorithm, and it looks like
> > utf8proc doesn't include the necessary collation information. This is
> > understandable since different locales have different collation rules,
> > but I'm going to propose adding DUCET, the Default Unicode Collation
> > Element Table, on their issue tracker since it doesn't look like it's
> > been discussed yet.
> 
> I opened https://github.com/JuliaLang/utf8proc earlier today.

You mentioned it earlier, and yes, collation is also an open problem
for musl. I want to do it based on UCA, not the POSIX localedef form
of collation tables.

Rich


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Updating Unicode support
  2018-01-24  6:26     ` Eric Pruitt
  2018-01-24 21:45       ` Rich Felker
@ 2018-01-24 21:48       ` Rich Felker
  2018-01-24 22:25         ` Eric Pruitt
  1 sibling, 1 reply; 10+ messages in thread
From: Rich Felker @ 2018-01-24 21:48 UTC (permalink / raw)
  To: musl

On Tue, Jan 23, 2018 at 10:26:02PM -0800, Eric Pruitt wrote:
> On Tue, Jan 23, 2018 at 04:51:33PM -0800, Eric Pruitt wrote:
> > On Tue, Jan 23, 2018 at 06:38:57PM -0500, Rich Felker wrote:
> > > OK. With this in mind, I hope you're also aware that musl's Unicode
> > > tables are all highly optimized for size and (aside from case mapping)
> > > very good speed relative to their size, and are generated mechanically
> > > from the UCD files via some ugly code here:
> > >
> > > https://github.com/richfelker/musl-chartable-tools
> 
> I updated my copy of musl to 1.1.18 then recompiled it with and without
> my utf8proc changes using GCC 6.3.0 "-O3" targeting Linux 4.9.0 /
> x86_64:
> 
> - Original implementation: 2,762,774B (musl-1.1.18/lib/libc.a)
> - utf8proc implementation: 3,055,954B (musl-1.1.18/lib/libc.a)
> - The utf8proc implementation is ~11% larger. I didn't do any
>   performance comparisons.

You're comparing the whole library, not character tables. If you
compare against all of ctype, it's a 15x size increase. If you compare
against just wcwidth, it's a 69x increase.

Rich


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Updating Unicode support
  2018-01-24 21:45       ` Rich Felker
@ 2018-01-24 22:22         ` Eric Pruitt
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Pruitt @ 2018-01-24 22:22 UTC (permalink / raw)
  To: musl

On Wed, Jan 24, 2018 at 04:45:39PM -0500, Rich Felker wrote:
> > > I finished reviewing the Unicode Collation Algorithm, and it looks like
> > > utf8proc doesn't include the necessary collation information. This is
> > > understandable since different locales have different collation rules,
> > > but I'm going to propose adding DUCET, the Default Unicode Collation
> > > Element Table, on their issue tracker since it doesn't look like it's
> > > been discussed yet.
> >
> > I opened https://github.com/JuliaLang/utf8proc earlier today.
>
> You mentioned it earlier, and yes, collation is also an open problem
> for musl. I want to do it based on UCA, not the POSIX localedef form
> of collation tables.

DUCET is part of the UCA specification.

Eric


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Updating Unicode support
  2018-01-24 21:48       ` Rich Felker
@ 2018-01-24 22:25         ` Eric Pruitt
  2018-01-24 22:53           ` Eric Pruitt
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Pruitt @ 2018-01-24 22:25 UTC (permalink / raw)
  To: musl

On Wed, Jan 24, 2018 at 04:48:53PM -0500, Rich Felker wrote:
> > I updated my copy of musl to 1.1.18 then recompiled it with and without
> > my utf8proc changes using GCC 6.3.0 "-O3" targeting Linux 4.9.0 /
> > x86_64:
> >
> > - Original implementation: 2,762,774B (musl-1.1.18/lib/libc.a)
> > - utf8proc implementation: 3,055,954B (musl-1.1.18/lib/libc.a)
> > - The utf8proc implementation is ~11% larger. I didn't do any
> >   performance comparisons.
>
> You're comparing the whole library, not character tables. If you
> compare against all of ctype, it's a 15x size increase. If you compare
> against just wcwidth, it's a 69x increase.

That was intentional. I have no clue what the common case is for other
people that use musl, but most applications **I** use make use of
various parts of musl, so I did the comparison on the library as a
whole.

Eric


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Updating Unicode support
  2018-01-24 22:25         ` Eric Pruitt
@ 2018-01-24 22:53           ` Eric Pruitt
  2018-01-24 23:32             ` Rich Felker
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Pruitt @ 2018-01-24 22:53 UTC (permalink / raw)
  To: musl

On Wed, Jan 24, 2018 at 02:25:06PM -0800, Eric Pruitt wrote:
> On Wed, Jan 24, 2018 at 04:48:53PM -0500, Rich Felker wrote:
> > > I updated my copy of musl to 1.1.18 then recompiled it with and without
> > > my utf8proc changes using GCC 6.3.0 "-O3" targeting Linux 4.9.0 /
> > > x86_64:
> > >
> > > - Original implementation: 2,762,774B (musl-1.1.18/lib/libc.a)
> > > - utf8proc implementation: 3,055,954B (musl-1.1.18/lib/libc.a)
> > > - The utf8proc implementation is ~11% larger. I didn't do any
> > >   performance comparisons.
> >
> > You're comparing the whole library, not character tables. If you
> > compare against all of ctype, it's a 15x size increase. If you compare
> > against just wcwidth, it's a 69x increase.
>
> That was intentional. I have no clue what the common case is for other
> people that use musl, but most applications **I** use make use of
> various parts of musl, so I did the comparison on the library as a
> whole.

If the size of utf8proc tables is a problem, I'm not sure how you'd go
about implementing UCA without them in an efficient manner. Part of the
UCA requires normalizing the Unicode strings and also needs character
property data to determine what sequence of characters in one string is
compared to a sequence of characters in another string. Perhaps you
could compromise by simply ignoring certain characters and not doing
normalization at all.

Since the utf8proc maintainer seems receptive to my proposed change, I'm
going to implement the collation feature in utf8proc, and if you decide
that utf8proc is worth the bloat, you'll get collation logic for "free."

Eric


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Updating Unicode support
  2018-01-24 22:53           ` Eric Pruitt
@ 2018-01-24 23:32             ` Rich Felker
  0 siblings, 0 replies; 10+ messages in thread
From: Rich Felker @ 2018-01-24 23:32 UTC (permalink / raw)
  To: musl

On Wed, Jan 24, 2018 at 02:53:18PM -0800, Eric Pruitt wrote:
> On Wed, Jan 24, 2018 at 02:25:06PM -0800, Eric Pruitt wrote:
> > On Wed, Jan 24, 2018 at 04:48:53PM -0500, Rich Felker wrote:
> > > > I updated my copy of musl to 1.1.18 then recompiled it with and without
> > > > my utf8proc changes using GCC 6.3.0 "-O3" targeting Linux 4.9.0 /
> > > > x86_64:
> > > >
> > > > - Original implementation: 2,762,774B (musl-1.1.18/lib/libc.a)
> > > > - utf8proc implementation: 3,055,954B (musl-1.1.18/lib/libc.a)
> > > > - The utf8proc implementation is ~11% larger. I didn't do any
> > > >   performance comparisons.
> > >
> > > You're comparing the whole library, not character tables. If you
> > > compare against all of ctype, it's a 15x size increase. If you compare
> > > against just wcwidth, it's a 69x increase.
> >
> > That was intentional. I have no clue what the common case is for other
> > people that use musl, but most applications **I** use make use of
> > various parts of musl, so I did the comparison on the library as a
> > whole.
> 
> If the size of utf8proc tables is a problem, I'm not sure how you'd go
> about implementing UCA without them in an efficient manner.

I don't think this is actually a productive discussion because the
metrics we're looking at aren't really meaningful. The libc.a size
doesn't tell you anything about how much of the code actually gets
linked when you use wcwidth, etc. Sorry, I should have noticed that
earlier.

> Part of the
> UCA requires normalizing the Unicode strings and also needs character
> property data to determine what sequence of characters in one string is
> compared to a sequence of characters in another string. Perhaps you
> could compromise by simply ignoring certain characters and not doing
> normalization at all.

There's currently nothing in libc that depends on any sort of
normalization, but IDN support (which has a patch pending) is related
and may need normalization to meet user expectations. If so, doing
normalization efficiently is a relevant problem for musl.

I forget if UCA actually needs normalization or not. I seem to
remember working out that you could just expand the collation tables
(mechanically at locale generation time) to account for variations in
composed/decomposed forms so that no normalization phase would be
necessary at runtime, but I may have been mistaken. It's been a long
time since I looked at it.

> Since the utf8proc maintainer seems receptive to my proposed change, I'm
> going to implement the collation feature in utf8proc, and if you decide

For sure.

> that utf8proc is worth the bloat, you'll get collation logic for "free."

That's not even the question, because we can't use outside libraries
directly. We could import code, but in the past that's been a really
bad idea (see TRE), or ideas/data structures behind the code, though.
It was probably a mistake of me to bring up the size/efficiency topic
to begin with, since it's not the core point, but I did want to
emphasize that the implementations we have were designed not just to
be simple and fairly small, but to be really small in the sense of
hard to make anything smaller.

Rich

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-01-24 23:32 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-23  1:54 Updating Unicode support Eric Pruitt
2018-01-23 23:38 ` Rich Felker
2018-01-24  0:51   ` Eric Pruitt
2018-01-24  6:26     ` Eric Pruitt
2018-01-24 21:45       ` Rich Felker
2018-01-24 22:22         ` Eric Pruitt
2018-01-24 21:48       ` Rich Felker
2018-01-24 22:25         ` Eric Pruitt
2018-01-24 22:53           ` Eric Pruitt
2018-01-24 23:32             ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).