From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.1 required=5.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H5, RCVD_IN_MSPIKE_WL,URIBL_DBL_BLOCKED_OPENDNS,URIBL_ZEN_BLOCKED_OPENDNS autolearn=ham autolearn_force=no version=3.4.4 Received: from second.openwall.net (second.openwall.net [193.110.157.125]) by inbox.vuxu.org (Postfix) with SMTP id 9D30E32FE9 for ; Wed, 17 Sep 2025 17:44:12 +0200 (CEST) Received: (qmail 17487 invoked by uid 550); 17 Sep 2025 15:44:08 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com x-ms-reactions: disallow Received: (qmail 17450 invoked from network); 17 Sep 2025 15:44:07 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1758123838; x=1758728638; darn=lists.openwall.com; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=DmNthmSfFVqlJ09lP0pxI6jL+IF5rfJYYaSz+W4UX2I=; b=Hz+yBmkAJEiVtezLWBBWKUtUD9OL9rBjzhtFsIo6IoPk3VX41pKsFVy3V9OpRNU1KE Pqr8KnqloJFnIwz/96P9rnL6qxwiDDnfz3cOcro1ipa0C8KgoDgzm/ONiVTIUEuPtYrG Viifs815Qc85NGvMt/8wm2lzTf/R1Po716bUdGEnAOG9Gs0j5A0E2jc1aRHfaemE0XUM j5vTcu2r0xaSvxd7jWc373C8nm5KZp/DnqreJP9XTAWEmv8VWCLeqDZFO3cf/9f92LVz Nup3WYm3boopV/udbYHXdeUPfuaY5z2lLfHGmiAMn3iSpDKDxRiVCEYKxN+bC2fWeLbk TUFg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758123838; x=1758728638; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=DmNthmSfFVqlJ09lP0pxI6jL+IF5rfJYYaSz+W4UX2I=; b=BhqzKUMuAXx1qQaVU6YXsef7ujpKxkE5dZ7IRUuiymZDTlTISePAS+Z53xi3xDOlgG pXe4aPm7MMfmaXNHeNBG4PLaFgkyNT8b+frqKO2THP92b4qXgR4kfLDN3S/sT9FdvE4O fKhTyH5g2GktviGp/YsPM7MnQjxijHeYgtbOrLE7ip3HP9aY3gXAHhbUiZZP/gjv7HBG LldsxIWAhR5MOqcLec7rhVrFMPaHyfl5IB3Lf079lv99jivAEMnk9FUShjjCUjpEHvDV tcrDUGZSgkCUR2nCnDi9haSdb0mM0/bAitPWBzkzfLMooYosSiLU3f564qB27fOuUA2i 8SJQ== X-Gm-Message-State: AOJu0YwV8FQdj+41bDS2llH29SGbvwwC+tooLGPNdab+2qfKa9iChaso M47ulwqV+VKakq6IUAUSCHy+rnSHYRCtYj6FmAHUz19TzruyoAT0syK3JOaHizYvf0fHIcchwtH fo2DnjwjtCiDZOF0EsSH5AH9c0me2BGzvWTWSBTdWxXG5nfM3Tz+e2ZPJnqY= X-Gm-Gg: ASbGncvYEEA7Th3/qc8PgK9PazQbXNTwMcJqpQbpBxqYJTuY0M0rdB31BNUxAhHhBEp J0arT/WbLip0r+q+nvB5BB3/DjEdpOYUgbnN3DiB1Lqeu4+7rkgifv19WBtnM/laJwOt19JNgCs RMhJsu/UbU+lKN/T52+fs5geraAuIFs2IAqppmnxyT2Lp+Gzkg/LWdS6sobBOYfw09QTq+jnz6W X/SMORcFCR/pdm579GOTQ== X-Google-Smtp-Source: AGHT+IHGPfGHNkE/VZQVr8qBAKrExeZjzwkbBeE1P2gAz19OQuB1+LnoWKCPggXXzPvGw/r2AbCVG139MjP4AqAC3vU= X-Received: by 2002:a05:6870:868c:b0:32f:25c6:eaff with SMTP id 586e51a60fabf-335bf2017b3mr1307146fac.29.1758123837461; Wed, 17 Sep 2025 08:43:57 -0700 (PDT) MIME-Version: 1.0 References: <20250917011406.GA11978@brightrain.aerifal.cx> In-Reply-To: <20250917011406.GA11978@brightrain.aerifal.cx> From: enh Date: Wed, 17 Sep 2025 11:43:46 -0400 X-Gm-Features: AS18NWDYCNAghZFuuTpuEbi-T0EtWnSDEBO-h5ZBbLcLVe5_NSZg3fc-Vj8g2Mg Message-ID: To: musl@lists.openwall.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: Re: [musl] Selecting locale source format On Tue, Sep 16, 2025 at 9:14=E2=80=AFPM Rich Felker wrote= : > > I have a proposed binary format for new locale files that I'm in the > process of writing up, but Pablo brought it to my attention that, > while binary format (ABI) is what's important to have down and stable > at the time we integrate into musl, pinning down the source format is > what's important/blocking for collaboration with localization folks. > > I have two candidate formats in the works right now for this: > > > > Option 1: subset+extension of POSIX localedef format. > > The basis for this format is described in > https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap07.html > > If we go this way, it would be a "subset" because (1) some parts are > not relevant, like LC_CTYPE, which does not vary by locale, note that that's not true for 'i' in turkish/azeri locales, for example. (unless you meant that you plan on using the unicode cldr data directly here.) see the "Language-Sensitive Mappings" section of SpecialCasing.txt for all the special cases. > (2) some > parts will necessarily be represented in different ways, like > collation where we're using UCA rather than the POSIX form, and (3) > the format just has a lot of gratuitous cruft like symbolic character > names. It will also necessarily be extended because POSIX localedef > has no way to represent translated error strings etc. - keys for them > have to be added. > > Going this route would have the source data in a fairly compact and > "well-known" (to certain audiences) form, but requires that the > tooling to produce binary locale files be aware of how these fields > translate to the data model for the binary form. > > A sample (should be roughly correct C/POSIX locale) is attached for > reference. > > > > > Option 2: human-readable/text representation of the binary form > > Describing this requires a basic intro to the binary form, which is a > multi-level hierarchical table mapping a path of integer key values to > a data blob. In text we can represent keys with symbolic constants, > but they're just a way of writing the underlying numbers. For example > the path strerror/0 leads to the "No error information" text, > strerror/EACCES leads to the "Permission denied" text, etc. Here > "strerror" just represents a number for the first-level path component > where strerror strings are stored, subindexed by (the arch/generic > versions of) the errno codes. > > Going this route mostly avoids the need for smarts in the tooling, and > "has more flexibility" to encode things. But this also potentially > makes the encoding seem more arbitrary to localization folks. > > Like in option 1, a sample (some hybrid between C/POSIX and a > hypothetical US-English locale, whipped up quick by hand as an > example) of one way this format could look is attached for reference. > An obvious variant that might be friendlier/more-familiar to folks > working with the data would be representing the same in json (which is > easy). > > > > > My leaning is towards option 1. >