After spending a bit wondering why files like "elder1" and "Elder2" end up at completely different spots in the file list on my postmarketOS (=Alpine-based) system, I filed a ticket with the Nemo file manager. Turns out Nemo just uses locale-dependent sorting, so I spent an hour trying to set LC_COLLATE to fix this, until I stumbled across the remark on musl's website that LC_COLLATE sorting is simply not supported. So I seem to be stuck with this, which I did not expect. This to me seems kind of disastrous on a desktop system. I just fail to see any average default user (who doesn't know ASCII in their head) expecting "elder1" and "Elder2" to be miles apart in a sorted listing even as a default US person, let alone in some other language that may be expected to use a different sorting for whatever reason. (This affects umlauts too, I assume? So that'd be most European languages having file lists entirely messed up, too.) The sorting shouldn't be stuck as something that just makes sense to programmers and balks at any special vowels, and it appears at least as of now there is just no way to fix this. Should desktop file managers like Nemo not be using this sorting function? Or is musl not intended for desktop use, and postmarketOS should switch? Otherwise, it seems like this omission in musl seems like kind of a big deal. Or is it really just me who is constantly confused as to where any file is at in any file lists...? Or in other words, would be kind of cool if this could be changed
On Fri, Jan 28, 2022 at 02:41:38PM +0100, ellie wrote:
> After spending a bit wondering why files like "elder1" and "Elder2"
> end up at completely different spots in the file list on my
> postmarketOS (=Alpine-based) system, I filed a ticket with the Nemo
> file manager. Turns out Nemo just uses locale-dependent sorting, so
> I spent an hour trying to set LC_COLLATE to fix this, until I
> stumbled across the remark on musl's website that LC_COLLATE sorting
> is simply not supported. So I seem to be stuck with this, which I
> did not expect.
>
> This to me seems kind of disastrous on a desktop system. I just fail
> to see any average default user (who doesn't know ASCII in their
> head) expecting "elder1" and "Elder2" to be miles apart in a sorted
> listing even as a default US person, let alone in some other
> language that may be expected to use a different sorting for
> whatever reason. (This affects umlauts too, I assume? So that'd be
> most European languages having file lists entirely messed up, too.)
> The sorting shouldn't be stuck as something that just makes sense to
> programmers and balks at any special vowels, and it appears at least
> as of now there is just no way to fix this.
>
> Should desktop file managers like Nemo not be using this sorting
> function? Or is musl not intended for desktop use, and postmarketOS
> should switch? Otherwise, it seems like this omission in musl seems
> like kind of a big deal. Or is it really just me who is constantly
> confused as to where any file is at in any file lists...?
>
> Or in other words, would be kind of cool if this could be changed
LC_COLLATE functionality is just not designed or implemented yet, due
to lack of interest/participation from folks who want it to happen. I
very much do want it to happen, but I don't want to design something
(data model for efficient collation tables & code to use them) only to
have it turn out not to meet everyone's/anyone's needs because there
was nobody to bounce questions/testing/what-if's off during the
design.
A big part of this is probably that, historically, *nix users tend to
be happy with (or even prefer, which they can explicitly set via
exporting LC_COLLATE=C) codepoint-order sorting of directory entries,
like Makefile and README appearing at the top. So to get these folks
to care you have to have another setting where collation order
matters.
I'm happy to restart the process for getting this done if ppl are
interested.
Rich
I don't think nowadays the majority of users should be expected to be
traditional *nix users with terminal knowledge anymore. And most modern
desktop distros don't default to such a sorting as far as I can tell,
and instead to en_US or alike - but all those which use musl are left
stranded with "C" sorting. The type of users who are hit most by this
are not going to be the type who know what a terminal is, what musl is,
or how to voice their opinion on LC_COLLATE because their file manager
looks so weird. So if you want them to show up here that probably won't
happen. Beyond myself, I suppose.
I think for a typical user-friendly desktop the need is kinda clear, so
I'm not sure what other sort of setting would need to be introduced
still. If musl is meant to be used on desktop distros, this just seems
kind of mandatory, or I'm not really getting why it wouldn't be.
My apologies however if I'm misunderstanding, but that was basically
your question/what you're saying is delaying it, right? Sorry if you
didn't want further input from me on this, I hope I read your e-mail right
On 1/28/22 3:10 PM, Rich Felker wrote:
> On Fri, Jan 28, 2022 at 02:41:38PM +0100, ellie wrote:
>> After spending a bit wondering why files like "elder1" and "Elder2"
>> end up at completely different spots in the file list on my
>> postmarketOS (=Alpine-based) system, I filed a ticket with the Nemo
>> file manager. Turns out Nemo just uses locale-dependent sorting, so
>> I spent an hour trying to set LC_COLLATE to fix this, until I
>> stumbled across the remark on musl's website that LC_COLLATE sorting
>> is simply not supported. So I seem to be stuck with this, which I
>> did not expect.
>>
>> This to me seems kind of disastrous on a desktop system. I just fail
>> to see any average default user (who doesn't know ASCII in their
>> head) expecting "elder1" and "Elder2" to be miles apart in a sorted
>> listing even as a default US person, let alone in some other
>> language that may be expected to use a different sorting for
>> whatever reason. (This affects umlauts too, I assume? So that'd be
>> most European languages having file lists entirely messed up, too.)
>> The sorting shouldn't be stuck as something that just makes sense to
>> programmers and balks at any special vowels, and it appears at least
>> as of now there is just no way to fix this.
>>
>> Should desktop file managers like Nemo not be using this sorting
>> function? Or is musl not intended for desktop use, and postmarketOS
>> should switch? Otherwise, it seems like this omission in musl seems
>> like kind of a big deal. Or is it really just me who is constantly
>> confused as to where any file is at in any file lists...?
>>
>> Or in other words, would be kind of cool if this could be changed
>
> LC_COLLATE functionality is just not designed or implemented yet, due
> to lack of interest/participation from folks who want it to happen. I
> very much do want it to happen, but I don't want to design something
> (data model for efficient collation tables & code to use them) only to
> have it turn out not to meet everyone's/anyone's needs because there
> was nobody to bounce questions/testing/what-if's off during the
> design.
>
> A big part of this is probably that, historically, *nix users tend to
> be happy with (or even prefer, which they can explicitly set via
> exporting LC_COLLATE=C) codepoint-order sorting of directory entries,
> like Makefile and README appearing at the top. So to get these folks
> to care you have to have another setting where collation order
> matters.
>
> I'm happy to restart the process for getting this done if ppl are
> interested.
>
> Rich
(Android's libc maintainer here...) i'd argue this isn't a musl bug. on Android we make a clear distinction between: 1. libc's responsibilities which, to paraphrase rich, are basically "be unsurprising because your audience is OS/app developers who don't speak all the languages their users use anyway". that is: "code point order". 2. icu's responsibilities which cover all the user-facing (as opposed to developer-facing) stuff. i18n is *hard* and the C/POSIX APIs are, to be blunt, not fit for *that* purpose. there's a reason why all of Android/macOS/Windows (and all the browsers) ship copies of icu. the bug here is that a desktop file manager is assuming "i just want telephone book order --- how hard can it be?". the answer turns out to be "hard". especially when you get into fun stuff like users who *do* speak multiple languages and have strong expectations for how they sort. or places where there are multiple sort orders in common use. you don't even need to be in very "exotic" languages to start hitting these things. German and Spanish will do fine. see https://unicode-org.github.io/icu/userguide/collation/ for a handful of specific examples. (as the maintainer of Android's Java i18n stuff before i ended up owning bionic, you'd be surprised at the extent to which even Java -- which tried pretty hard by 1990s standards -- doesn't really cover everything you need, not even for languages like Russian. so i don't think C/POSIX could have done a great job in the 1990s, and one of icu's main benefits is that it's been able to evolve to better support existing languages/support more languages rather than being ossified by an insufficient standard.) "if you care about your users, you need icu/CLDR" is the easy side of the argument. the flip side -- that libc *shouldn't* get involved -- is trickier. what convinced me was the amount of *breakage* you cause if you try to be "good guy greg"... it turns out no-one wants dotless i breaking their build just because their locale is a turkish/azeri locale, for example. (dotted/dotless i is by far the most common real-world issue i've seen.) but it's that kind of "text manipulation tool used during builds" that are most likely to use libc functionality, and although, sure, we can chase *everyone* making sure they set their locale to "C" when building ... are we helping at that point, or just making more work for everyone? (without actually solving the real problem for the folks who just want to use their file browser.) On Fri, Jan 28, 2022 at 7:06 AM ellie <el@horse64.org> wrote: > > I don't think nowadays the majority of users should be expected to be > traditional *nix users with terminal knowledge anymore. And most modern > desktop distros don't default to such a sorting as far as I can tell, > and instead to en_US or alike - but all those which use musl are left > stranded with "C" sorting. The type of users who are hit most by this > are not going to be the type who know what a terminal is, what musl is, > or how to voice their opinion on LC_COLLATE because their file manager > looks so weird. So if you want them to show up here that probably won't > happen. Beyond myself, I suppose. > > I think for a typical user-friendly desktop the need is kinda clear, so > I'm not sure what other sort of setting would need to be introduced > still. If musl is meant to be used on desktop distros, this just seems > kind of mandatory, or I'm not really getting why it wouldn't be. > > My apologies however if I'm misunderstanding, but that was basically > your question/what you're saying is delaying it, right? Sorry if you > didn't want further input from me on this, I hope I read your e-mail right > > On 1/28/22 3:10 PM, Rich Felker wrote: > > On Fri, Jan 28, 2022 at 02:41:38PM +0100, ellie wrote: > >> After spending a bit wondering why files like "elder1" and "Elder2" > >> end up at completely different spots in the file list on my > >> postmarketOS (=Alpine-based) system, I filed a ticket with the Nemo > >> file manager. Turns out Nemo just uses locale-dependent sorting, so > >> I spent an hour trying to set LC_COLLATE to fix this, until I > >> stumbled across the remark on musl's website that LC_COLLATE sorting > >> is simply not supported. So I seem to be stuck with this, which I > >> did not expect. > >> > >> This to me seems kind of disastrous on a desktop system. I just fail > >> to see any average default user (who doesn't know ASCII in their > >> head) expecting "elder1" and "Elder2" to be miles apart in a sorted > >> listing even as a default US person, let alone in some other > >> language that may be expected to use a different sorting for > >> whatever reason. (This affects umlauts too, I assume? So that'd be > >> most European languages having file lists entirely messed up, too.) > >> The sorting shouldn't be stuck as something that just makes sense to > >> programmers and balks at any special vowels, and it appears at least > >> as of now there is just no way to fix this. > >> > >> Should desktop file managers like Nemo not be using this sorting > >> function? Or is musl not intended for desktop use, and postmarketOS > >> should switch? Otherwise, it seems like this omission in musl seems > >> like kind of a big deal. Or is it really just me who is constantly > >> confused as to where any file is at in any file lists...? > >> > >> Or in other words, would be kind of cool if this could be changed > > > > LC_COLLATE functionality is just not designed or implemented yet, due > > to lack of interest/participation from folks who want it to happen. I > > very much do want it to happen, but I don't want to design something > > (data model for efficient collation tables & code to use them) only to > > have it turn out not to meet everyone's/anyone's needs because there > > was nobody to bounce questions/testing/what-if's off during the > > design. > > > > A big part of this is probably that, historically, *nix users tend to > > be happy with (or even prefer, which they can explicitly set via > > exporting LC_COLLATE=C) codepoint-order sorting of directory entries, > > like Makefile and README appearing at the top. So to get these folks > > to care you have to have another setting where collation order > > matters. > > > > I'm happy to restart the process for getting this done if ppl are > > interested. > > > > Rich
Hi,
On Fri, 28 Jan 2022, Rich Felker wrote:
> On Fri, Jan 28, 2022 at 02:41:38PM +0100, ellie wrote:
>> After spending a bit wondering why files like "elder1" and "Elder2"
>> end up at completely different spots in the file list on my
>> postmarketOS (=Alpine-based) system, I filed a ticket with the Nemo
>> file manager. Turns out Nemo just uses locale-dependent sorting, so
>> I spent an hour trying to set LC_COLLATE to fix this, until I
>> stumbled across the remark on musl's website that LC_COLLATE sorting
>> is simply not supported. So I seem to be stuck with this, which I
>> did not expect.
>>
>> This to me seems kind of disastrous on a desktop system. I just fail
>> to see any average default user (who doesn't know ASCII in their
>> head) expecting "elder1" and "Elder2" to be miles apart in a sorted
>> listing even as a default US person, let alone in some other
>> language that may be expected to use a different sorting for
>> whatever reason. (This affects umlauts too, I assume? So that'd be
>> most European languages having file lists entirely messed up, too.)
>> The sorting shouldn't be stuck as something that just makes sense to
>> programmers and balks at any special vowels, and it appears at least
>> as of now there is just no way to fix this.
>>
>> Should desktop file managers like Nemo not be using this sorting
>> function? Or is musl not intended for desktop use, and postmarketOS
>> should switch? Otherwise, it seems like this omission in musl seems
>> like kind of a big deal. Or is it really just me who is constantly
>> confused as to where any file is at in any file lists...?
>>
>> Or in other words, would be kind of cool if this could be changed
>
> LC_COLLATE functionality is just not designed or implemented yet, due
> to lack of interest/participation from folks who want it to happen. I
> very much do want it to happen, but I don't want to design something
> (data model for efficient collation tables & code to use them) only to
> have it turn out not to meet everyone's/anyone's needs because there
> was nobody to bounce questions/testing/what-if's off during the
> design.
>
> A big part of this is probably that, historically, *nix users tend to
> be happy with (or even prefer, which they can explicitly set via
> exporting LC_COLLATE=C) codepoint-order sorting of directory entries,
> like Makefile and README appearing at the top. So to get these folks
> to care you have to have another setting where collation order
> matters.
A case-study might be PostgreSQL, but I believe we solved collation there
by using the ICU library instead.
Ariadne
On Fri, Jan 28, 2022 at 08:58:30AM -0800, enh wrote: > (Android's libc maintainer here...) > > i'd argue this isn't a musl bug. on Android we make a clear distinction between: > > 1. libc's responsibilities which, to paraphrase rich, are basically > "be unsurprising because your audience is OS/app developers who don't > speak all the languages their users use anyway". that is: "code point > order". That's not what I said. I speculated that part of the difficulty with getting people to care is that a large number of users personally prefer LC_COLLATE=C. Not that we should punt because of that. > 2. icu's responsibilities which cover all the user-facing (as opposed > to developer-facing) stuff. i18n is *hard* and the C/POSIX APIs are, > to be blunt, not fit for *that* purpose. there's a reason why all of > Android/macOS/Windows (and all the browsers) ship copies of icu. ICU is really, *really* bad. I don't want to be encouraging people to use it because basic functionality is missing from libc. > the bug here is that a desktop file manager is assuming "i just want > telephone book order --- how hard can it be?". the answer turns out to > be "hard". especially when you get into fun stuff like users who *do* > speak multiple languages and have strong expectations for how they > sort. or places where there are multiple sort orders in common use. Absolutely. That's why I don't want to treat the problem half-assedly, but make sure we design or choose a format for the collation tables that's simultaneously (1) efficient, (2) sufficiently expressive to give the behaviors users may want, and (3) easy enough to understand that users can customize it if needed. The POSIX localedef format (an option group musl intentionally does not support) does not have any of those properties except maybe #2. The standard Unicode format may translate directly into something that can meet all 3; I'm not sure. Rich
Hi,
On Fri, 28 Jan 2022, ellie wrote:
> I don't think nowadays the majority of users should be expected to be
> traditional *nix users with terminal knowledge anymore. And most modern
> desktop distros don't default to such a sorting as far as I can tell, and
> instead to en_US or alike - but all those which use musl are left stranded
> with "C" sorting. The type of users who are hit most by this are not going to
> be the type who know what a terminal is, what musl is, or how to voice their
> opinion on LC_COLLATE because their file manager looks so weird. So if you
> want them to show up here that probably won't happen. Beyond myself, I
> suppose.
>
> I think for a typical user-friendly desktop the need is kinda clear, so I'm
> not sure what other sort of setting would need to be introduced still. If
> musl is meant to be used on desktop distros, this just seems kind of
> mandatory, or I'm not really getting why it wouldn't be.
>
> My apologies however if I'm misunderstanding, but that was basically your
> question/what you're saying is delaying it, right? Sorry if you didn't want
> further input from me on this, I hope I read your e-mail right
LC_COLLATE is a desired feature in musl, but getting it right is going to
take some work. We should want to be careful about it because we want to
avoid having giant tables, or some plug-in architecture like GLIBC has,
which was recently at the center of the pwnkit debacle.
Ariadne
On Fri, Jan 28, 2022 at 10:01 AM Rich Felker <dalias@libc.org> wrote: > > On Fri, Jan 28, 2022 at 08:58:30AM -0800, enh wrote: > > (Android's libc maintainer here...) > > > > i'd argue this isn't a musl bug. on Android we make a clear distinction between: > > > > 1. libc's responsibilities which, to paraphrase rich, are basically > > "be unsurprising because your audience is OS/app developers who don't > > speak all the languages their users use anyway". that is: "code point > > order". > > That's not what I said. I speculated that part of the difficulty with > getting people to care is that a large number of users personally > prefer LC_COLLATE=C. Not that we should punt because of that. > > > 2. icu's responsibilities which cover all the user-facing (as opposed > > to developer-facing) stuff. i18n is *hard* and the C/POSIX APIs are, > > to be blunt, not fit for *that* purpose. there's a reason why all of > > Android/macOS/Windows (and all the browsers) ship copies of icu. > > ICU is really, *really* bad. I don't want to be encouraging people to > use it because basic functionality is missing from libc. human languages are really really messy. a lot of the complexity is inherent. as for the non-inherent, https://github.com/unicode-org/icu4x seems like a good start. > > the bug here is that a desktop file manager is assuming "i just want > > telephone book order --- how hard can it be?". the answer turns out to > > be "hard". especially when you get into fun stuff like users who *do* > > speak multiple languages and have strong expectations for how they > > sort. or places where there are multiple sort orders in common use. > > Absolutely. That's why I don't want to treat the problem half-assedly, but that's my point --- it's not the *implementation* that's the issue, it's that the C/POSIX *interfaces* are insufficient. the bar on how good a job you _can_ do within those constraints is horribly low. > but make sure we design or choose a format for the collation tables > that's simultaneously (1) efficient, (2) sufficiently expressive to > give the behaviors users may want, and (3) easy enough to understand > that users can customize it if needed. The POSIX localedef format (an > option group musl intentionally does not support) does not have any of > those properties except maybe #2. The standard Unicode format may > translate directly into something that can meet all 3; I'm not sure. > > Rich
On Fri, Jan 28, 2022 at 10:33:53AM -0800, enh wrote: > On Fri, Jan 28, 2022 at 10:01 AM Rich Felker <dalias@libc.org> wrote: > > > > On Fri, Jan 28, 2022 at 08:58:30AM -0800, enh wrote: > > > (Android's libc maintainer here...) > > > > > > i'd argue this isn't a musl bug. on Android we make a clear distinction between: > > > > > > 1. libc's responsibilities which, to paraphrase rich, are basically > > > "be unsurprising because your audience is OS/app developers who don't > > > speak all the languages their users use anyway". that is: "code point > > > order". > > > > That's not what I said. I speculated that part of the difficulty with > > getting people to care is that a large number of users personally > > prefer LC_COLLATE=C. Not that we should punt because of that. > > > > > 2. icu's responsibilities which cover all the user-facing (as opposed > > > to developer-facing) stuff. i18n is *hard* and the C/POSIX APIs are, > > > to be blunt, not fit for *that* purpose. there's a reason why all of > > > Android/macOS/Windows (and all the browsers) ship copies of icu. > > > > ICU is really, *really* bad. I don't want to be encouraging people to > > use it because basic functionality is missing from libc. > > human languages are really really messy. a lot of the complexity is inherent. > > as for the non-inherent, https://github.com/unicode-org/icu4x seems > like a good start. The problems with ICU are all software engineering problems not problem-domain complexity problems. Bad resource-hungry choices with poor safety properties all over. > > > the bug here is that a desktop file manager is assuming "i just want > > > telephone book order --- how hard can it be?". the answer turns out to > > > be "hard". especially when you get into fun stuff like users who *do* > > > speak multiple languages and have strong expectations for how they > > > sort. or places where there are multiple sort orders in common use. > > > > Absolutely. That's why I don't want to treat the problem half-assedly, > > but that's my point --- it's not the *implementation* that's the > issue, it's that the C/POSIX *interfaces* are insufficient. the bar on > how good a job you _can_ do within those constraints is horribly low. I'm not sure what you mean by "the interfaces are insufficient" here. They're insufficient to do things they weren't meant to do (e.g. deal with data with multiple cultural conventions where the data has to be tagged with which conventions apply to it), but giving listings in a user's chosen collation order convention is something they're perfectly capable of doing. Most applications do not want to deal with (and do not even have the necessary metadata to deal with, since the raw data is plain text) the sort of mix the standard interfaces can't handle. They just want to give decent, culturally-non-surprising UX. Applications that do want to go beyond this can of course use the full Unicode data (via ICU or ideally a better alternative). Rich
On Fri, Jan 28, 2022 at 01:01:04PM -0500, Rich Felker wrote:
> ICU is really, *really* bad. I don't want to be encouraging people to
> use it because basic functionality is missing from libc.
>
But basic functionality *is* missing from libc, and by design. By the
standard. For example, toupper and towupper can only return a single
code point. That doesn't work with German's ß character, which has the
capital form SS. If you were transforming some general German word group
into block capitals for a headline or something, that is the
transformation you would use. Now, some people have invented a capital
version of ß, that is still new enough to make blocks appear in many
programs (test your mail program here: ẞ), but that letter is not widely
used.
Also, many applications expect towupper and towlower to be inverse
functions of each other, but here, not all instance of SS ought to be
transformed to ß when passing them through towlower, even if the
interface did support such a thing.
My point is that the development of interfaces that deal with
internationalization might be better put into a library with an
interface less rigid than libc, where any adjustment moves at the
glacial pace of the Austin Group or WG14, and in any case, breaking
changes are completely out of the question. That is also why we still
have gets() and strchr().
Whether ICU is a suitable library for that purpose I lack the expertise
to say. However, all I have heard about it so far is either that one
should use it to cure all i18n ills, or that it is an abomination unto
the Lord. But even the people in the second camp fail to recommend a
superior alternative. So I'm guessing there isn't one.
As to the actual function in question: Simply having a possibility to
switch strcoll to be the same as strcasecmp instead of strcmp would
probably already be the 80% solution for most European languages.
Yeah, it won't work with umlauts, but we Germans are used to that. "It
is <current year> and we still can't do umlauts" is a common curse
levelled at information technology, and for the most part it is apt. I
routinely counsel against using umlauts in file names or pass phrases,
because you never know what character set it gets saved in or
transmitted later, and it just causes avoidable problems. I really doubt
this issue will ever be solved within my lifetime.
JM2C,
Markus