From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/9823 Path: news.gmane.org!not-for-mail From: Assaf Gordon Newsgroups: gmane.linux.lib.musl.general Subject: Re: Possible bug in setlocale upon invalid LC_ALL value Date: Fri, 1 Apr 2016 22:46:25 -0400 Message-ID: <9292C698-FABF-4721-AFC6-221ABAAD14F5@gmail.com> References: <4C4AEBC7-4344-4867-B8F6-F1A691F123E0@gmail.com> <20160402005858.GA21636@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2102\)) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1459565205 15632 80.91.229.3 (2 Apr 2016 02:46:45 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 2 Apr 2016 02:46:45 +0000 (UTC) Cc: musl@lists.openwall.com To: Rich Felker Original-X-From: musl-return-9836-gllmg-musl=m.gmane.org@lists.openwall.com Sat Apr 02 04:46:44 2016 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1amBaG-0008Vm-0y for gllmg-musl@m.gmane.org; Sat, 02 Apr 2016 04:46:44 +0200 Original-Received: (qmail 21939 invoked by uid 550); 2 Apr 2016 02:46:40 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 21915 invoked from network); 2 Apr 2016 02:46:39 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=/pLvPy4F4gpgQP4r1+iII7N7xfTOOFdBqZD6E4uIJn8=; b=e8+MOGSdgdhcMGnUHHFhyCl2CrsCXuceoi8F2UcaD/839yye7M/kdZJITHf7OHJsxs 6fKt7yZPeiQWls9wZe4QXIhw5Xt6CmV+XfQUQuYt+u0kC3nAkzEZSoKlHuyBdB71nUCl QHsZR0V4y2YSBrn96fladddk592v1f/rBqKvSpBHaGtfpyCa7g6nhDMHF5eBjQdNrmME DCAcej3Gj/syBzrn08cQSpg0bYBbEprosrPbRmaWF+zMCdGj6Ert/loGxWNxjXWYbJfC Uq8v8JTQ5J2IintwlpbhR58pqr8+FKXoHaDKhSbkKNcpP9+1V1DIxqcYgaNMwkUXsk48 nXVQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=/pLvPy4F4gpgQP4r1+iII7N7xfTOOFdBqZD6E4uIJn8=; b=ADqkPTzR2Tg7SkqlYZRLjw5g4SSj1DlTGpi9pO3CEjVRUY+j4I6KV+ggKGNjK9woji dq/xs+trk+jIXhfUkCFKPp07TrsqAlQhMt0NKU8yEyb/gGMVRSwXqEX8ZKTmNW8wMKsW fMvpnrjlxlDMDZ2f9zsjZbijYFxyFapyxenZNRS/DndNdeJLi6/CqGpRhSA1BDhFE1kH bc9FggxMG/jidwHZ4hDVAQuV2hVD6DtR1xVLQdl0lrJIsGjoslrI95Z7cJPrW5PQRkhk IQuDnPzSBhtF3kOhBwvRKs5VW0gV691FHtbvu/AnsxE3mQ11rkplC7VOyDNYSTsivPt+ R+7A== X-Gm-Message-State: AD7BkJJLsFUr3ZvUpjI7vZQ8JkwAwjrKjH6hf9FnmXlrjCM9WPiWb3EXdU3GXI2zfxvvWw== X-Received: by 10.140.144.132 with SMTP id 126mr10946749qhq.102.1459565187547; Fri, 01 Apr 2016 19:46:27 -0700 (PDT) In-Reply-To: <20160402005858.GA21636@brightrain.aerifal.cx> X-Mailer: Apple Mail (2.2102) Xref: news.gmane.org gmane.linux.lib.musl.general:9823 Archived-At: Hello Rich, thank you for the prompt and detailed response. > On Apr 1, 2016, at 20:58, Rich Felker wrote: >=20 > On Fri, Apr 01, 2016 at 08:47:01PM -0400, Assaf Gordon wrote: >> I think I've encountered a problem in musl, where using setlocale = with invalid locale name returns the invalid locale instead of a known = locale. >=20 > This is intentional. All locale names are valid under musl, and those > which don't have any particular definition are just aliases for > C.UTF-8. I will suggest a minor fix to GNU coreutils to accommodate for this = current implementation. > The alternative would be that UTF-8 support breaks whenever > LC_* vars are set but locales are not installed/configured, which > would pretty much _always_ be the case when running a static-linked > standalone binary on a non-musl-based system (where LC_* are probably > set to something the main host libc recognizes). >=20 > One possibility if this behavior is problematic would be to only > consider names without their own definitions as aliases for C.UTF-8 > when MUSL_LOCPATH is not set. However I think we'd need to see a > strong motivation for doing that, since it seems like it would be > worse behavior in some ways, especially when using LC_MESSAGES set to > a language for which you don't have a locale installed. I'm not an expert about locales to argue one way or the other. Naively, I would think that this is somewhat problematic, because a = best-behaving program (one that checks set locale's return code for = errors) has no way to warn the user that he/she used an invalid locale. Perhaps a work-around would be to handle it this way: if an invalid (non-existing) locale is given in LC_* env vars, = setlocale(LC_ALL,"") should return NULL (indicating an error), then all = other invocations of setlocale(LC_*,NULL) would return the "C.UTF-8" = indicator. This would allow detecting the error, but not affect further = processing (if invalid locales are already an alias to C.UTF-8). This = seems to match other OSes/libcs which return fixed "C" in such cases. The reason for such check is that it is common user mistake to specify = non-existing locales, then be confused by the seemingly incorrect = results. Allowing a program to detect incorrect locales is a good = mitigation. I'll side-step the non-UTF-8 locales (which would be a problem in the = current musl auto-aliasing to UTF-8), and show one possible case where = silent aliasing leads to incorrect results. consider the following UTF-8 string: M N =C3=91 O P Y Z =C3=86 =C3=98 =C3=85 (which includes Spanish e=C3=B1e and the last three letters in the = Swedish alphabet). When sorting with locale-aware programs, different locales should give = different collation orders (e.g. es_ES.UTF-8 vs sv_FI.UTF-8). To reproduce: = A=3D'\116\n\303\221\n\117\n\120\n\131\n\132\n\303\205\n\303\204\n\303\226\= n' printf "$A" | LC_ALL=3Dsv_FI.UTF-8 sort printf "$A" | LC_ALL=3Des_ES.UTF-8 sort If a user has a typo in the locale name (e.g. sv_SV.UTF-8), there's no = way for a program to detect it, and he will get unexpected ordered = results. GNU coreutils' 'sort' program added a --debug option to help user = diagnose such issues. On Linux with glibc, this will be the output: $ printf "$A" | LC_ALL=3Des_ES.UTF-8 sort --debug > /dev/null sort: using =E2=80=98es_ES.UTF-8=E2=80=99 sorting rules $ printf "$A" | LC_ALL=3Dsv_FI.UTF-8 sort --debug > /dev/null = =20 sort: using =E2=80=98sv_FI.UTF-8=E2=80=99 sorting rules $ printf "$A" | LC_ALL=3Dsv_SV.UTF-8 sort --debug > /dev/null = =20 sort: using simple byte comparison=20 $ printf "$A" | LC_ALL=3Dfoobar sort --debug > /dev/null = =20 sort: using simple byte comparison The last two messages ("simple byte") is the hint that the locale is = invalid, and sort will does not use it. On Alpine (linux + musl), there's no way to detect such case: $ printf "$A" | LC_ALL=3Dsv_FI.UTF-8 gsort --debug > /dev/null gsort: using =E2=80=98sv_FI.UTF-8=E2=80=99 sorting rules $ printf "$A" | LC_ALL=3Dsv_SV.UTF-8 gsort --debug > /dev/null gsort: using =E2=80=98sv_SV.UTF-8=E2=80=99 sorting rules $ printf "$A" | LC_ALL=3Dfoobar gsort --debug > /dev/null gsort: using =E2=80=98foobar=E2=80=99 sorting rules regards, - assaf