mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] High-level binary format for new locale files
@ 2025-11-20  3:30 Rich Felker
  2025-12-10 22:56 ` Pablo Correa Gomez
  2026-02-25 15:02 ` Rich Felker
  0 siblings, 2 replies; 4+ messages in thread
From: Rich Felker @ 2025-11-20  3:30 UTC (permalink / raw)
  To: musl

The following is a draft that I've had pending for a while now,
regarding the binary format to be mmapped/processed at runtime, which
I put aside for a while to focus on what the localization-work-facing
source format would look like. It still needs some polishing and
fleshing out that will happen alongside out-of-tree implementation of
code for processing the locale data, but I think it's useful to have
the high-level design written up in public where it can be discussed
and used as reference in the future. This is part of the locale
support overhaul project, funded by NLnet and the NGI Zero Core Fund.




On a high level, the format is a multi-level table mapping "paths" of
integer keys to (usually textual) data blobs. First, some motivating
principles:

An important goal is that the built-in C locale data (langinfo,
strerror family, etc.) should be able to be represented in the exact
same form as an external locale file. This makes it so that we don't
need to have two versions of all of the lookup code, one for the
existing internal data and another for processing locale files. It
also means that we can get rid of some of the inefficient
linear-search logic for the built-in data now, making both
non-localized and localized performance better.

This kind of linear search elimination was already done for strerror
by Timo Teräs in commit 8343334d7b. I've been building on the same
concept (multiple inclusion of a header file defining the data, with
different context each time to expand to different parts of the table)
so that we don't need to "pre-compile" the built-in C locale data to
binary blobs like the ctype data, iconv data, nfd decomposition data,
etc. but can instead let the preprocessor do the work and keep the
data itself in editable source form.

It's also desirable that the same data format used for locale strings
(langinfo, strerror, etc.) also work for collation elements. This
doesn't entirely preclude having a single flat integer namespace of
keys (for example you could or a code onto the upper bits of
codepoints to mean "collation element") but it does suggest against
it.

With the above in mind, the high-level design looks like this:

Lookups are to be performed according to a "path" of integer keys,
where each path component may traverse one or more table levels. For
example, if top-level index 1 is langinfo strings, 1/0x20000 leads to
the ABDAY_1 string. In general there is a property,

	lookup(root,mmm/nnn) = lookup(lookup(root,mmm),nnn)

so that something (like collation) needing to perform lots of lookups
can just find its starting point in the tree once, and perform each
subsequent lookup relative to that.

Because key spread may be sparse (for example, the langinfo keys have
a category starting at bit 16 and an index within the category
starting at bit 0), individual "path components" can be represented as
multiple levels in the table structure, with base/shift defined by the
file. For example, the langinfo subtable will typically define a first
level with base 0x2000 (there are no category-0 or -1 items) and shift
16, and leaf levels for each category.

While it seems like we could just skip the ability of the data to
define its own table levels like this, and instead treat something
sparse like langinfo keys as 2 path components (using the above
example, 1/2/0 for ABDAY_1), the above goal being able to use the same
data structure and table traversal code for collation elements means
we already want the flexibility to represent sparse tables. The
specifics of collation element representation in the table structure
will be fleshed out later and may inform tuning. I am in the process
of munging base collation data to measure how large resulting tables
will be and what adjustments if any might be needed to represent the
data and do so efficiently.



To demo simplified use of the table design and a potential specific
binary format to use, I have a draft version of the include files to
produce built-in C locale data described above. These need a little
polishing still, so I'll include them in a follow-up to come soon.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [musl] High-level binary format for new locale files
  2025-11-20  3:30 [musl] High-level binary format for new locale files Rich Felker
@ 2025-12-10 22:56 ` Pablo Correa Gomez
  2026-02-25 15:02 ` Rich Felker
  1 sibling, 0 replies; 4+ messages in thread
From: Pablo Correa Gomez @ 2025-12-10 22:56 UTC (permalink / raw)
  To: Rich Felker, musl

Thanks a lot for this work! The fact that there is already a precedent
for this work in the sense of the optimization of strerror is very
neat. To me it means that the pattern is more widly applicable, and
therefore easier to understand for people looking at the code.

I wonder if we should archive the rationale laid out here somewhere
else than in the mailing list archive. This might be very relevant for
anybody looking at the code for the first time to understand what is
going on and why. Generally newcomers also don't have an easy or fast
way to search the archives.

Best,
Pablo Correa Gomez

El mie, 19-11-2025 a las 22:30 -0500, Rich Felker escribió:
> The following is a draft that I've had pending for a while now,
> regarding the binary format to be mmapped/processed at runtime, which
> I put aside for a while to focus on what the localization-work-facing
> source format would look like. It still needs some polishing and
> fleshing out that will happen alongside out-of-tree implementation of
> code for processing the locale data, but I think it's useful to have
> the high-level design written up in public where it can be discussed
> and used as reference in the future. This is part of the locale
> support overhaul project, funded by NLnet and the NGI Zero Core Fund.
> 
> 
> 
> 
> On a high level, the format is a multi-level table mapping "paths" of
> integer keys to (usually textual) data blobs. First, some motivating
> principles:
> 
> An important goal is that the built-in C locale data (langinfo,
> strerror family, etc.) should be able to be represented in the exact
> same form as an external locale file. This makes it so that we don't
> need to have two versions of all of the lookup code, one for the
> existing internal data and another for processing locale files. It
> also means that we can get rid of some of the inefficient
> linear-search logic for the built-in data now, making both
> non-localized and localized performance better.
> 
> This kind of linear search elimination was already done for strerror
> by Timo Teräs in commit 8343334d7b. I've been building on the same
> concept (multiple inclusion of a header file defining the data, with
> different context each time to expand to different parts of the
> table)
> so that we don't need to "pre-compile" the built-in C locale data to
> binary blobs like the ctype data, iconv data, nfd decomposition data,
> etc. but can instead let the preprocessor do the work and keep the
> data itself in editable source form.
> 
> It's also desirable that the same data format used for locale strings
> (langinfo, strerror, etc.) also work for collation elements. This
> doesn't entirely preclude having a single flat integer namespace of
> keys (for example you could or a code onto the upper bits of
> codepoints to mean "collation element") but it does suggest against
> it.
> 
> With the above in mind, the high-level design looks like this:
> 
> Lookups are to be performed according to a "path" of integer keys,
> where each path component may traverse one or more table levels. For
> example, if top-level index 1 is langinfo strings, 1/0x20000 leads to
> the ABDAY_1 string. In general there is a property,
> 
>  lookup(root,mmm/nnn) = lookup(lookup(root,mmm),nnn)
> 
> so that something (like collation) needing to perform lots of lookups
> can just find its starting point in the tree once, and perform each
> subsequent lookup relative to that.
> 
> Because key spread may be sparse (for example, the langinfo keys have
> a category starting at bit 16 and an index within the category
> starting at bit 0), individual "path components" can be represented
> as
> multiple levels in the table structure, with base/shift defined by
> the
> file. For example, the langinfo subtable will typically define a
> first
> level with base 0x2000 (there are no category-0 or -1 items) and
> shift
> 16, and leaf levels for each category.
> 
> While it seems like we could just skip the ability of the data to
> define its own table levels like this, and instead treat something
> sparse like langinfo keys as 2 path components (using the above
> example, 1/2/0 for ABDAY_1), the above goal being able to use the
> same
> data structure and table traversal code for collation elements means
> we already want the flexibility to represent sparse tables. The
> specifics of collation element representation in the table structure
> will be fleshed out later and may inform tuning. I am in the process
> of munging base collation data to measure how large resulting tables
> will be and what adjustments if any might be needed to represent the
> data and do so efficiently.
> 
> 
> 
> To demo simplified use of the table design and a potential specific
> binary format to use, I have a draft version of the include files to
> produce built-in C locale data described above. These need a little
> polishing still, so I'll include them in a follow-up to come soon.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [musl] High-level binary format for new locale files
  2025-11-20  3:30 [musl] High-level binary format for new locale files Rich Felker
  2025-12-10 22:56 ` Pablo Correa Gomez
@ 2026-02-25 15:02 ` Rich Felker
  2026-03-02 13:11   ` Pablo Correa Gomez
  1 sibling, 1 reply; 4+ messages in thread
From: Rich Felker @ 2026-02-25 15:02 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2144 bytes --]

On Wed, Nov 19, 2025 at 10:30:43PM -0500, Rich Felker wrote:
> This kind of linear search elimination was already done for strerror
> by Timo Teräs in commit 8343334d7b. I've been building on the same
> concept (multiple inclusion of a header file defining the data, with
> different context each time to expand to different parts of the table)
> so that we don't need to "pre-compile" the built-in C locale data to
> binary blobs like the ctype data, iconv data, nfd decomposition data,
> etc. but can instead let the preprocessor do the work and keep the
> data itself in editable source form.
> 
> It's also desirable that the same data format used for locale strings
> (langinfo, strerror, etc.) also work for collation elements. This
> doesn't entirely preclude having a single flat integer namespace of
> keys (for example you could or a code onto the upper bits of
> codepoints to mean "collation element") but it does suggest against
> it.
> 
> [...]
> 
> To demo simplified use of the table design and a potential specific
> binary format to use, I have a draft version of the include files to
> produce built-in C locale data described above. These need a little
> polishing still, so I'll include them in a follow-up to come soon.

This was supposed to be posted a long time ago, but better late than
never. The naming and parametrization might be a little bit clunky and
I expect to rework it for actual inclusion in musl with the locale
work, but it demonstrates all the concepts.

The attached __strerror.h is just from current musl. strerror2.c does
not contain any lookup code; it's just the top-level file to
instantiate the table. Compiling it to an object file lets you examine
the emitted binary format. Compiling it with -E to see preprocessed
output is also informative.

The magic is in mdecl2.h and mdata2.h. These define the binary format
and how the source data passed to the M() macro translates into the
binary format. At the moment I don't have a presentable version
actually using the multi-level aspect of the table, which nl_langinfo
needs, and which needs to be there in a minimal form in strerror data.

Rich

[-- Attachment #2: __strerror.h --]
[-- Type: text/plain, Size: 4063 bytes --]

/* The first entry is a catch-all for codes not enumerated here.
 * This file is included multiple times to declare and define a structure
 * with these messages, and then to define a lookup table translating
 * error codes to offsets of corresponding fields in the structure. */

M(0,            "No error information")

M(EILSEQ,       "Illegal byte sequence")
M(EDOM,         "Domain error")
M(ERANGE,       "Result not representable")

M(ENOTTY,       "Not a tty")
M(EACCES,       "Permission denied")
M(EPERM,        "Operation not permitted")
M(ENOENT,       "No such file or directory")
M(ESRCH,        "No such process")
M(EEXIST,       "File exists")

M(EOVERFLOW,    "Value too large for data type")
M(ENOSPC,       "No space left on device")
M(ENOMEM,       "Out of memory")

M(EBUSY,        "Resource busy")
M(EINTR,        "Interrupted system call")
M(EAGAIN,       "Resource temporarily unavailable")
M(ESPIPE,       "Invalid seek")

M(EXDEV,        "Cross-device link")
M(EROFS,        "Read-only file system")
M(ENOTEMPTY,    "Directory not empty")

M(ECONNRESET,   "Connection reset by peer")
M(ETIMEDOUT,    "Operation timed out")
M(ECONNREFUSED, "Connection refused")
M(EHOSTDOWN,    "Host is down")
M(EHOSTUNREACH, "Host is unreachable")
M(EADDRINUSE,   "Address in use")

M(EPIPE,        "Broken pipe")
M(EIO,          "I/O error")
M(ENXIO,        "No such device or address")
M(ENOTBLK,      "Block device required")
M(ENODEV,       "No such device")
M(ENOTDIR,      "Not a directory")
M(EISDIR,       "Is a directory")
M(ETXTBSY,      "Text file busy")
M(ENOEXEC,      "Exec format error")

M(EINVAL,       "Invalid argument")

M(E2BIG,        "Argument list too long")
M(ELOOP,        "Symbolic link loop")
M(ENAMETOOLONG, "Filename too long")
M(ENFILE,       "Too many open files in system")
M(EMFILE,       "No file descriptors available")
M(EBADF,        "Bad file descriptor")
M(ECHILD,       "No child process")
M(EFAULT,       "Bad address")
M(EFBIG,        "File too large")
M(EMLINK,       "Too many links")
M(ENOLCK,       "No locks available")

M(EDEADLK,      "Resource deadlock would occur")
M(ENOTRECOVERABLE, "State not recoverable")
M(EOWNERDEAD,   "Previous owner died")
M(ECANCELED,    "Operation canceled")
M(ENOSYS,       "Function not implemented")
M(ENOMSG,       "No message of desired type")
M(EIDRM,        "Identifier removed")
M(ENOSTR,       "Device not a stream")
M(ENODATA,      "No data available")
M(ETIME,        "Device timeout")
M(ENOSR,        "Out of streams resources")
M(ENOLINK,      "Link has been severed")
M(EPROTO,       "Protocol error")
M(EBADMSG,      "Bad message")
M(EBADFD,       "File descriptor in bad state")
M(ENOTSOCK,     "Not a socket")
M(EDESTADDRREQ, "Destination address required")
M(EMSGSIZE,     "Message too large")
M(EPROTOTYPE,   "Protocol wrong type for socket")
M(ENOPROTOOPT,  "Protocol not available")
M(EPROTONOSUPPORT,"Protocol not supported")
M(ESOCKTNOSUPPORT,"Socket type not supported")
M(ENOTSUP,      "Not supported")
M(EPFNOSUPPORT, "Protocol family not supported")
M(EAFNOSUPPORT, "Address family not supported by protocol")
M(EADDRNOTAVAIL,"Address not available")
M(ENETDOWN,     "Network is down")
M(ENETUNREACH,  "Network unreachable")
M(ENETRESET,    "Connection reset by network")
M(ECONNABORTED, "Connection aborted")
M(ENOBUFS,      "No buffer space available")
M(EISCONN,      "Socket is connected")
M(ENOTCONN,     "Socket not connected")
M(ESHUTDOWN,    "Cannot send after socket shutdown")
M(EALREADY,     "Operation already in progress")
M(EINPROGRESS,  "Operation in progress")
M(ESTALE,       "Stale file handle")
M(EUCLEAN,      "Data consistency error")
M(ENAVAIL,      "Resource not available")
M(EREMOTEIO,    "Remote I/O error")
M(EDQUOT,       "Quota exceeded")
M(ENOMEDIUM,    "No medium found")
M(EMEDIUMTYPE,  "Wrong medium type")
M(EMULTIHOP,    "Multihop attempted")
M(ENOKEY,       "Required key not available")
M(EKEYEXPIRED,  "Key has expired")
M(EKEYREVOKED,  "Key has been revoked")
M(EKEYREJECTED, "Key was rejected by service")

[-- Attachment #3: sterror2.c --]
[-- Type: text/plain, Size: 192 bytes --]

#include <stddef.h>
#include <errno.h>

#define M_SOURCE "__strerror.h"
#define M_NAME c_strerror
#define M_BASE 0
#include "mdecl2.h"

struct c_strerror c_strerror = {
#include "mdata2.h"
};

[-- Attachment #4: mdata2.h --]
[-- Type: text/plain, Size: 575 bytes --]

#define MDATA_CONCAT_(a,b) a##b
#define MDATA_CONCAT(a,b) MDATA_CONCAT_(a,b)
#define BE32(x) 0,0,0,0
#define BE16(x) 0,0

.header = {
	0,0,0,0,
	BE32(offsetof(struct M_NAME, data)),
	BE16(M_BASE & M_MASK),
	BE16(sizeof(MDATA_CONCAT(M_NAME,_offsets))/2-1),
},
.offsets = {
#define M(n, s) \
	[2*((n)-M_BASE)] = \
	(offsetof(struct MDATA_CONCAT(M_NAME,_data), m_##n)+1)/256,\
	(offsetof(struct MDATA_CONCAT(M_NAME,_data), m_##n)+1)%256,
#include M_SOURCE
#undef M
},
.data = {
#define M(n, s) .m_##n = s,
#include M_SOURCE
#undef M
},

#undef MDATA_CONCAT
#undef MDATA_CONCAT_

[-- Attachment #5: mdecl2.h --]
[-- Type: text/plain, Size: 526 bytes --]

#include <stddef.h>

#define MDECL_CONCAT_(a,b) a##b
#define MDECL_CONCAT(a,b) MDECL_CONCAT_(a,b)

struct MDECL_CONCAT(M_NAME,_data) {
#define M(n, s) char m_##n[sizeof(s)];
#include M_SOURCE
#undef M
};

typedef unsigned char MDECL_CONCAT(M_NAME,_offsets)[sizeof (unsigned char []){
#define M(n, s) [2*((n)-M_BASE)] = 0, 0,
#include M_SOURCE
#undef M
}];

struct M_NAME {
	unsigned char header[12];
	MDECL_CONCAT(M_NAME,_offsets) offsets;
	struct MDECL_CONCAT(M_NAME,_data) data;
};

#undef MDECL_CONCAT
#undef MDECL_CONCAT_

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [musl] High-level binary format for new locale files
  2026-02-25 15:02 ` Rich Felker
@ 2026-03-02 13:11   ` Pablo Correa Gomez
  0 siblings, 0 replies; 4+ messages in thread
From: Pablo Correa Gomez @ 2026-03-02 13:11 UTC (permalink / raw)
  To: Rich Felker, musl

> This was supposed to be posted a long time ago, but better late than
> never. The naming and parametrization might be a little bit clunky and
> I expect to rework it for actual inclusion in musl with the locale
> work, but it demonstrates all the concepts.

Thanks for following up!

> The attached __strerror.h is just from current musl. strerror2.c does
> not contain any lookup code; it's just the top-level file to
> instantiate the table. Compiling it to an object file lets you examine
> the emitted binary format. Compiling it with -E to see preprocessed
> output is also informative.
> 
> The magic is in mdecl2.h and mdata2.h. These define the binary format
> and how the source data passed to the M() macro translates into the
> binary format. At the moment I don't have a presentable version
> actually using the multi-level aspect of the table, which nl_langinfo
> needs, and which needs to be there in a minimal form in strerror data.

I know this is not the final version, but I think something that might be super
helpful here is including some sort of documentation in those headers, or at
least a link to the discussion in this thread. I think for future people reading
the code would be super nice to have access to the history.


Pablo

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-03-02 13:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-20  3:30 [musl] High-level binary format for new locale files Rich Felker
2025-12-10 22:56 ` Pablo Correa Gomez
2026-02-25 15:02 ` Rich Felker
2026-03-02 13:11   ` Pablo Correa Gomez

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).