From: Pablo Correa Gomez <pabloyoyoista@postmarketos.org>
To: Rich Felker <dalias@libc.org>, "A. Wilcox" <AWilcox@Wilcox-Tech.com>
Cc: musl@lists.openwall.com
Subject: Re: [musl] Selecting locale source format
Date: Fri, 19 Sep 2025 16:06:12 +0200 [thread overview]
Message-ID: <66590520b9fef551b6fa0f3b0b6beed579b413e4.camel@postmarketos.org> (raw)
In-Reply-To: <20250917013655.GU1827@brightrain.aerifal.cx>
El mar, 16-09-2025 a las 21:36 -0400, Rich Felker escribió:
> On Tue, Sep 16, 2025 at 08:23:09PM -0500, A. Wilcox wrote:
> > On Sep 16, 2025, at 20:14, Rich Felker <dalias@libc.org> wrote:
> > >
> > > I have a proposed binary format for new locale files that I'm in
> > > the
> > > process of writing up, but Pablo brought it to my attention that,
> > > while binary format (ABI) is what's important to have down and
> > > stable
> > > at the time we integrate into musl, pinning down the source
> > > format is
> > > what's important/blocking for collaboration with localization
> > > folks.
> > >
> > > I have two candidate formats in the works right now for this:
> > >
> > >
> > >
> > > Option 1: subset+extension of POSIX localedef format.
> > >
> > > The basis for this format is described in
> > > https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap
> > > 07.html
> > >
> > > If we go this way, it would be a "subset" because (1) some parts
> > > are
> > > not relevant, like LC_CTYPE, which does not vary by locale, (2)
> > > some
> > > parts will necessarily be represented in different ways, like
> > > collation where we're using UCA rather than the POSIX form, and
> > > (3)
> > > the format just has a lot of gratuitous cruft like symbolic
> > > character
> > > names. It will also necessarily be extended because POSIX
> > > localedef
> > > has no way to represent translated error strings etc. - keys for
> > > them
> > > have to be added.
> > >
> > > Going this route would have the source data in a fairly compact
> > > and
> > > "well-known" (to certain audiences) form, but requires that the
> > > tooling to produce binary locale files be aware of how these
> > > fields
> > > translate to the data model for the binary form.
> > >
> > > A sample (should be roughly correct C/POSIX locale) is attached
> > > for
> > > reference.
> > >
> > >
> > >
> > >
> > > Option 2: human-readable/text representation of the binary form
> > >
> > > Describing this requires a basic intro to the binary form, which
> > > is a
> > > multi-level hierarchical table mapping a path of integer key
> > > values to
> > > a data blob. In text we can represent keys with symbolic
> > > constants,
> > > but they're just a way of writing the underlying numbers. For
> > > example
> > > the path strerror/0 leads to the "No error information" text,
> > > strerror/EACCES leads to the "Permission denied" text, etc. Here
> > > "strerror" just represents a number for the first-level path
> > > component
> > > where strerror strings are stored, subindexed by (the
> > > arch/generic
> > > versions of) the errno codes.
> > >
> > > Going this route mostly avoids the need for smarts in the
> > > tooling, and
> > > "has more flexibility" to encode things. But this also
> > > potentially
> > > makes the encoding seem more arbitrary to localization folks.
> > >
> > > Like in option 1, a sample (some hybrid between C/POSIX and a
> > > hypothetical US-English locale, whipped up quick by hand as an
> > > example) of one way this format could look is attached for
> > > reference.
> > > An obvious variant that might be friendlier/more-familiar to
> > > folks
> > > working with the data would be representing the same in json
> > > (which is
> > > easy).
> > >
> > >
> > >
> > >
> > > My leaning is towards option 1.
> > >
> > > <sample_posix_localedef.txt><sample_binary_as_text.txt>
> >
> > Hi Rich,
> >
> > Thanks for continuing the locale work - very happy to see it
> > progressing!
> >
> > I definitely prefer option 1 as well. This will allow an easy
> > migration path for people using other Unix or Unix-like systems
> > (Solaris, AIX, glibc Linux) where localedef is also used. It also
> > means there is also a large corpus of existing files we can use,
> > both for testing the tooling and for initial drafts at porting musl
> > to other locales.
> >
> > I think it is reasonable to extend the file to handle translations
> > for days of the week/months. Is there a reason the existing system
> > of gettext(3) can’t be used for strerror_l?
>
> The fundamental problem with the current system we have is gettext
> keying off of the English string. That was fatal for [AB]MON_5 "May",
> but it's also less than ideal for error messages. For example it's
> plausible we might use the same text for an errno code as for a regex
> or getaddrinfo error message, and then the keys would clash. And of
> course if the messages are changed at all, translation files get
> invalidated.
@A.Wilcox, in case you missed it, the decision to go for this kind of
representation was discussed in
https://www.openwall.com/lists/musl/2025/06/02/2, point 1. Sorry that
ended up being a bit of a long email.
Best,
Pablo
>
> I'll go over the proposed new binary format more when I finish
> writing
> it up, but on top of avoiding all these issues, it lets us get rid of
> all the repetitive linear-search-multistring operations in musl and
> replace them with efficient O(1) lookup regardless of whether a
> locale
> file or internal messages in libc are being used.
>
> Rich
next prev parent reply other threads:[~2025-09-19 14:06 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-17 1:14 Rich Felker
2025-09-17 1:23 ` A. Wilcox
2025-09-17 1:36 ` Rich Felker
2025-09-19 14:06 ` Pablo Correa Gomez [this message]
2025-09-17 15:43 ` enh
2025-09-17 17:37 ` Rich Felker
2025-09-17 20:31 ` Rich Felker
2025-09-19 13:59 ` Pablo Correa Gomez
2025-10-01 13:55 ` Pablo Correa Gomez
2025-10-01 17:21 ` Markus Wichmann
2025-10-01 17:51 ` Demi Marie Obenour
2025-10-02 2:34 ` Rich Felker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=66590520b9fef551b6fa0f3b0b6beed579b413e4.camel@postmarketos.org \
--to=pabloyoyoista@postmarketos.org \
--cc=AWilcox@Wilcox-Tech.com \
--cc=dalias@libc.org \
--cc=musl@lists.openwall.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).