From: enh <enh@google.com>
To: musl@lists.openwall.com
Subject: Re: [musl] Selecting locale source format
Date: Wed, 17 Sep 2025 11:43:46 -0400 [thread overview]
Message-ID: <CAJgzZooiidR18yF3jY0098_ugguiwB59dT2NXs4MYg8tfAF1BQ@mail.gmail.com> (raw)
In-Reply-To: <20250917011406.GA11978@brightrain.aerifal.cx>
On Tue, Sep 16, 2025 at 9:14 PM Rich Felker <dalias@libc.org> wrote:
>
> I have a proposed binary format for new locale files that I'm in the
> process of writing up, but Pablo brought it to my attention that,
> while binary format (ABI) is what's important to have down and stable
> at the time we integrate into musl, pinning down the source format is
> what's important/blocking for collaboration with localization folks.
>
> I have two candidate formats in the works right now for this:
>
>
>
> Option 1: subset+extension of POSIX localedef format.
>
> The basis for this format is described in
> https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap07.html
>
> If we go this way, it would be a "subset" because (1) some parts are
> not relevant, like LC_CTYPE, which does not vary by locale,
note that that's not true for 'i' in turkish/azeri locales, for
example. (unless you meant that you plan on using the unicode cldr
data directly here.)
see the "Language-Sensitive Mappings" section of SpecialCasing.txt for
all the special cases.
> (2) some
> parts will necessarily be represented in different ways, like
> collation where we're using UCA rather than the POSIX form, and (3)
> the format just has a lot of gratuitous cruft like symbolic character
> names. It will also necessarily be extended because POSIX localedef
> has no way to represent translated error strings etc. - keys for them
> have to be added.
>
> Going this route would have the source data in a fairly compact and
> "well-known" (to certain audiences) form, but requires that the
> tooling to produce binary locale files be aware of how these fields
> translate to the data model for the binary form.
>
> A sample (should be roughly correct C/POSIX locale) is attached for
> reference.
>
>
>
>
> Option 2: human-readable/text representation of the binary form
>
> Describing this requires a basic intro to the binary form, which is a
> multi-level hierarchical table mapping a path of integer key values to
> a data blob. In text we can represent keys with symbolic constants,
> but they're just a way of writing the underlying numbers. For example
> the path strerror/0 leads to the "No error information" text,
> strerror/EACCES leads to the "Permission denied" text, etc. Here
> "strerror" just represents a number for the first-level path component
> where strerror strings are stored, subindexed by (the arch/generic
> versions of) the errno codes.
>
> Going this route mostly avoids the need for smarts in the tooling, and
> "has more flexibility" to encode things. But this also potentially
> makes the encoding seem more arbitrary to localization folks.
>
> Like in option 1, a sample (some hybrid between C/POSIX and a
> hypothetical US-English locale, whipped up quick by hand as an
> example) of one way this format could look is attached for reference.
> An obvious variant that might be friendlier/more-familiar to folks
> working with the data would be representing the same in json (which is
> easy).
>
>
>
>
> My leaning is towards option 1.
>
next prev parent reply other threads:[~2025-09-17 15:44 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-17 1:14 Rich Felker
2025-09-17 1:23 ` A. Wilcox
2025-09-17 1:36 ` Rich Felker
2025-09-19 14:06 ` Pablo Correa Gomez
2025-09-17 15:43 ` enh [this message]
2025-09-17 17:37 ` Rich Felker
2025-09-17 20:31 ` Rich Felker
2025-09-19 13:59 ` Pablo Correa Gomez
2025-10-01 13:55 ` Pablo Correa Gomez
2025-10-01 17:21 ` Markus Wichmann
2025-10-01 17:51 ` Demi Marie Obenour
2025-10-02 2:34 ` Rich Felker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAJgzZooiidR18yF3jY0098_ugguiwB59dT2NXs4MYg8tfAF1BQ@mail.gmail.com \
--to=enh@google.com \
--cc=musl@lists.openwall.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).