mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Subject: Re: Build option to disable locale [was: Byte-based C locale, draft 1]
Date: Sun, 7 Jun 2015 20:33:15 -0400	[thread overview]
Message-ID: <20150608003315.GD17573@brightrain.aerifal.cx> (raw)
In-Reply-To: <5574DAE7.8040101@gmx.de>

On Mon, Jun 08, 2015 at 01:59:35AM +0200, Harald Becker wrote:
> On 07.06.2015 02:24, Rich Felker wrote:
> >It's somewhat more clear what you're talking about, but I'm still not
> >sure what specific pieces of code you would want to omit from libc.so.
> >Which of the following would you want to remove or keep?
> 
> I did not look into all the details ...
> 
> In general: Keep the API, but add stubs with minimal operation or
> fail for none C locale (etc.).
> 
> >- UTF-8 encoding and decoding
> 
> May be of use to keep, if on bare minimum.

This is roughly 3k of code, and is mandatory if you want to say you
"support UTF-8" at all. I'll note the other parts that fundamentally
depend on it.

> >- Character properties
> > - Case mappings
> 
> Keep ASCII, map all none ASCII to a single value.

I assume by "map to a single value" you mean uniform properties for
all non-ASCII Unicode characters, e.g. just printable but nothing
else. Case-mapping everything down to one character would not be a
good idea. :-)

Character properties are roughly 11k of code. Case mappings are 1k of
code.

Note that while some of the properties are arguably not very useful
(the wctype system does not give you enough information to do serious
text processing with them), without the wcwidth property, you cannot
properly display non-ASCII text on a terminal. So at least this one,
which takes 3k, is pretty critical to "UTF-8 support".

> >- Internal message translation (nl_langinfo strings, errors, etc.)
> > - Message translation API (gettext)
> 
> No translation at all, keep the English messages (as short as possible).

The internal translation support is about 2k. The gettext system is
roughly another 2k on top of that (and depends on the former).

I agree this is completely non-mandatory for "UTF-8 support" and
that's why musl originally didn't have it.

> >- Charset conversion (iconv)
> 
> Copy ASCII / UTF-8, but fail for all other.

iconv is big. About 128k. The ability to selectively omit some or all
legacy charsets from iconv is a long-term goal.

Of course if you have an actual need for character set conversion,
e.g. reading email in mutt, then your alternative to musl's 128k iconv
is GNU libiconv weighing in at several MB...

> >- Non-ASCII characters in regex and fnmatch patterns/brackers
> 
> May be the question to allow for UTF-8, but only those, no other
> charsets (should allow to do some optimization and avoid all the
> extended overhead).

That's how it is now.

> fnmatch: Match None ASCII just 1:1, no other special operation.
> 
> regex: Don't have the experience on the internals of this topic. In
> general allow for 1:1 matching of none ASCII characters, but
> otherwise behave as C locale (e.g. equivalence classes).

For both fnmatch and regex, the single-character-match (? or .
respectively) matches characters, not bytes. Likewise bracket
expressions match characters. In order for this to work at all, you
need UTF-8 decoding (see above).

There's no directly measurable code size cost for these items; the
savings from not doing UTF-8 would come from completely different code
that doesn't now exist in musl for bypassing mbtowc and just working
directly on input bytes.

So aside from iconv, the above seem to total around 19k, and at least
6k of that is mandatory if you want to be able to claim to support
UTF-8. So the topic at hand seems to be whether you can save <13k of
libc.so size by hacking out character handling/locale related features
that are non-essential to basic UTF-8 support...

Rich


  parent reply	other threads:[~2015-06-08  0:33 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-06 21:40 [PATCH] Byte-based C locale, draft 1 Rich Felker
2015-06-06 22:39 ` Harald Becker
2015-06-06 23:10   ` Rich Felker
2015-06-06 23:59     ` Harald Becker
2015-06-07  0:24       ` Rich Felker
2015-06-07 23:59         ` Build option to disable locale [was: Byte-based C locale, draft 1] Harald Becker
2015-06-08  0:28           ` Josiah Worcester
2015-06-08  1:57             ` Harald Becker
2015-06-08  2:36               ` Rich Felker
2015-06-08  3:35                 ` Harald Becker
2015-06-08  3:51                   ` Josiah Worcester
2015-06-08  0:33           ` Rich Felker [this message]
2015-06-08  2:46             ` Harald Becker
2015-06-08  4:06               ` Rich Felker
2015-06-09  3:20               ` Isaac Dunham
2015-06-09  4:27                 ` Rich Felker
2015-06-07  1:17 ` [PATCH] Byte-based C locale, draft 1 Rich Felker
2015-06-07  2:50 ` Rich Felker
2015-06-13  7:06   ` [PATCH] Byte-based C locale, draft 2 Rich Felker
2015-06-16  4:26     ` Rich Felker
2015-06-16  4:35       ` Rich Felker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150608003315.GD17573@brightrain.aerifal.cx \
    --to=dalias@libc.org \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).