From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.1 required=5.0 tests=DKIM_INVALID,DKIM_SIGNED, MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 29455 invoked from network); 10 Aug 2023 15:42:07 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 10 Aug 2023 15:42:07 -0000 Received: (qmail 5556 invoked by uid 550); 10 Aug 2023 15:42:04 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 5524 invoked from network); 10 Aug 2023 15:42:03 -0000 X-Proofpoint-GUID: j6DXU10BkVfJqfo6MZ6zyzfox1Y10_AH X-Proofpoint-ORIG-GUID: j6DXU10BkVfJqfo6MZ6zyzfox1Y10_AH X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.591,18.0.957 definitions=2023-07-11_09:2023-07-11,2023-07-11 signatures=0 X-Proofpoint-Spam-Details: rule=interactive_user_notspam policy=interactive_user score=0 mlxscore=0 spamscore=0 adultscore=0 suspectscore=0 malwarescore=0 mlxlogscore=490 phishscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2305260000 definitions=main-2307110153 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=apple.com; h=from : content-type : content-transfer-encoding : mime-version : subject : message-id : date : cc : to; s=20180706; bh=rMW2+Xapk7CcrVELHwwS3vEEj2VX5xYb5cZlwJjqRcE=; b=HdiCrEqd7AEqghWqFebYC3GYckUJmWzvnHEVQE2rsdeHuB/WvV6Esl02NZ9p+wD+pxit mWYj3SmK0DO0RsfeqD7vIvizFClMt910vklvuQ0Wrabmba4+R/QglTcRucl3KdSFQz/2 t/njh0ehLFMWGGQnX+4Jqa7UUTXWlmuKn4XjHzv/ZKxpHGxlT/yaMVrHzh6DsXLgflEd f/Aff/wWiLw5cLpzxtz+154a4KpdtYVVcwDOvL1GrnwH2qePWAkMkjIOGHfemnU7mwFE 01rwoqpoTSEJYhh7pCeyhmsEAPyZcvPwIk3nWLOB9SE9GPccdS7oMNkzpAVHjTJID9Lh GQ== X-Va-A: X-Va-T-CD: 14dc138f66d48087821ca031f1b11e6a X-Va-E-CD: b5b146b4735bce8b68c10abcff0ed769 X-Va-R-CD: b5642562c38e010d3844a3a2a0d69725 X-Va-ID: cd72172d-4010-4ae5-8f67-7a1675a2643f X-Va-CD: 0 X-V-A: X-V-T-CD: 14dc138f66d48087821ca031f1b11e6a X-V-E-CD: b5b146b4735bce8b68c10abcff0ed769 X-V-R-CD: b5642562c38e010d3844a3a2a0d69725 X-V-ID: c4106d92-699f-41aa-9415-44ae74917f7e X-V-CD: 0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.591,18.0.957 definitions=2023-08-10_13:2023-08-09,2023-08-10 signatures=0 From: Alastair Houghton Content-type: text/plain; charset=utf-8 Content-transfer-encoding: quoted-printable MIME-version: 1.0 (Mac OS X Mail 16.0 \(3772.100.2\)) Message-id: <1390B046-C845-406F-8AED-620F2DD16BC0@apple.com> Date: Thu, 10 Aug 2023 16:41:38 +0100 Cc: Rich Felker To: musl@lists.openwall.com X-Mailer: Apple Mail (2.3772.100.2) Subject: [musl] setlocale() again Hi again, I spent some time today looking at the setlocale() problem and thought = I=E2=80=99d put some notes down in an email. 1. Musl wishes to support UTF-8 =E2=80=9Cout of the box=E2=80=9D. 2. At the same time, it needs to be 8-bit-safe, so the default locale, = C, is NOT UTF-8. 3. POSIX, and the C standard, specify that setlocale() should fail if = the locale name isn=E2=80=99t a valid locale, but don=E2=80=99t really = say what that means precisely. A program that wants UTF-8 support and = that does `setlocale(LC_ALL, =E2=80=9C=E2=80=9D)` can therefore find = itself in the C locale if the one specified in the environment happens = to be invalid. 4. This seemed undesirable, so setlocale() presently accepts any locale = name as valid; if it doesn=E2=80=99t have a definition file for a = locale, it will copy the C.UTF-8 locale, giving it the name passed in = and return that. This avoids the problem in (3), and also means that = gettext() will work for any language without installing locale data for = Musl. Unfortunately it also means that there is no way for a program = (notably a test suite) to determine the presence of data for a locale, = because setlocale() will always succeed, even if we don=E2=80=99t have = the data. 5. Back in 2017 (https://www.openwall.com/lists/musl/2017/11/08/2) Rich = was proposing to change things so that `setlocale(cat, =E2=80=9C=E2=80=9D)= ` always succeeds, but if the environment specifies an unknown locale, = treats it as C.UTF-8, while `setlocale(cat, explicit_name)` will fail = unless a valid definition file is installed for that locale name. This = would also avoid the problem in (3), although it will mean that = gettext() will not work unless a valid locale definition is installed = for the C library (BTW, this is exactly the situation Glibc is in here; = if Glibc doesn=E2=80=99t have locale data, it will fail setlocale() and = then gettext() will find itself in the C locale). On the other hand, it = does mean that programs can detect whether or not a given locale is = present. Why do I care? Because I=E2=80=99m trying to make libc++ work with Musl = and right now it has failing tests because it expects (not entirely = unreasonably) that if e.g. `setlocale(LC_ALL, =E2=80=9Cfr_FR=E2=80=9D)` = succeeds, then the C library will localise things into French. While I = can test for the unusual behaviour of Musl detailed in (4), the libc++ = maintainer understandably doesn=E2=80=99t like it and we would both far = rather Musl were fixed to behave similarly to other implementations. It seems to me that Rich=E2=80=99s proposal (5) was sensible. Programs = that use gettext(), and users relying on it for localization, must = already cope with the fact that the C library must have locale data for = their chosen locale in order for gettext() to work; that is how things = work on Glibc. It so happened that (4) meant that such programs would = work with partial localization on Musl without there being any locale = data installed for Musl, but that isn=E2=80=99t really right (e.g. you = might get a mix of localized strings from gettext() but with numeric = formatting that didn=E2=80=99t match - for French, for instance, numbers = would have =E2=80=9C.=E2=80=9Ds instead of =E2=80=9C,=E2=80=9Ds as a = decimal separator). Looking at the 2017 thread, it appears it didn=E2=80=99t go anywhere for = whatever reason, so I=E2=80=99d like to understand the status of the = proposed change. Was it nixed for some reason? Is it likely to happen = in the future? If it=E2=80=99s a matter of resource, if I were to raise = a patch for it, would it be accepted, in principle? Kind regards, Alastair.