I'm not an expert in this, but I believe a pure Haskell solution mean implementing the Unicode Collation Algorithm. The Unicode Common Locale Data Repository contains the per-locale settings to configure the algorithm to sort according to the locale's rules. This is what ICU does.
"'Nick Bart' via pandoc-discuss"
<pandoc-...@googlegroups.com> writes:
> An unofficial fork of text-icu claims to have fixed the issue (https://github.com/WorldSEnder/text-icu/commit/7657227a7ca8ad13be86db5c190b806774a5fd6b).
>
> I wonder if anyone could indicate how to tweak the pandoc install command to include, for the time being, the WorldSEnder/text-icu fork rather than the official one - or whether there is anything else I could try to fix this issue on the pandoc side. (I tried downgrading icu4c via homebrew, but apparenty no formulae for earlier versions are available.)
Replace stack.yaml with this:
``` stack.yaml
flags:
pandoc:
trypandoc: false
embed_data_files: true
QuickCheck:
old-random: false
citeproc:
icu: true
packages:
- '.'
extra-deps:
- hslua-1.3.0
- hslua-module-path-0.1.0
- jira-wiki-markup-1.3.4
- skylighting-core-0.10.5
- skylighting-0.10.5
- doclayout-0.3.0.2
- citeproc-0.3.0.9
- texmath-0.12.2
- random-1.2.0
- git: https://github.com/WorldSEnder/text-icu
commit: 7657227a7ca8ad13be86db5c190b806774a5fd6b
ghc-options:
"$locals": -fhide-source-paths -Wno-missing-home-modules
resolver: lts-17.5
nix:
packages: [zlib]
```
Then stack install.
> As an aside, while I fully understand the wish not having to include a huge external C library by default, I feel that pandoc’s default sorting algorithm, currently based on “i;unicode-casemap” (RFC 5051), is somewhat below par. In particular, it does not even comply with mainstream English-language rules as far accented characters are concerned. The Chicago Manual of Style (17e, 2017, 16.67) unambiguously states: “Words beginning with or including accented letters are alphabetized as though they were unaccented.” One of their examples gives the sort order “Ubeda – Über – Ubina“. Without icu support, pandoc incorrectly sort this as “Ubeda – Ubina – Über“.
Yes. I agree. Actually, if we just need special treatment for
English locales, then I don't think it should be too hard. We
can use the Haskell unicode-transforms library (already a
dependency of pandoc) to normalize the text and then remove
accents:
Prelude Data.Text.Normalize Data.Text Data.Char> Data.Text.filter (not . isMark) $ normalize NFD "dérégler"
"deregler"
We could sort on the result of that transform.
(This method would also affect non-Western scripts, though, and
I don't know what the rules around those are...)
For non-English locales, would we want to fall back to RFC 5051?
I'm not sure what all the relevant rules are; if it's not too
terribly complicated, I wonder if a pure Haskell library could
be cooked up. It's a shame that there's no way to do proper
unicode collation in Haskell without the difficult icu4
dependency.