I'm not an expert in this, but I believe a pure Haskell solution mean implementing the Unicode Collation Algorithm. The Unicode Common Locale Data Repository contains the per-locale settings to configure the algorithm to sort according to the locale's rules. This is what ICU does.

On Monday, March 22, 2021 at 6:56:04 AM UTC+1 John MacFarlane wrote:
"'Nick Bart' via pandoc-discuss"
<pandoc-...@googlegroups.com> writes:

> An unofficial fork of text-icu claims to have fixed the issue (https://github.com/WorldSEnder/text-icu/commit/7657227a7ca8ad13be86db5c190b806774a5fd6b).
>
> I wonder if anyone could indicate how to tweak the pandoc install command to include, for the time being, the WorldSEnder/text-icu fork rather than the official one - or whether there is anything else I could try to fix this issue on the pandoc side. (I tried downgrading icu4c via homebrew, but apparenty no formulae for earlier versions are available.)

Replace stack.yaml with this:


``` stack.yaml
flags:
pandoc:
trypandoc: false
embed_data_files: true
QuickCheck:
old-random: false
citeproc:
icu: true
packages:
- '.'
extra-deps:
- hslua-1.3.0
- hslua-module-path-0.1.0
- jira-wiki-markup-1.3.4
- skylighting-core-0.10.5
- skylighting-0.10.5
- doclayout-0.3.0.2
- citeproc-0.3.0.9
- texmath-0.12.2
- random-1.2.0
- git: https://github.com/WorldSEnder/text-icu
commit: 7657227a7ca8ad13be86db5c190b806774a5fd6b
ghc-options:
"$locals": -fhide-source-paths -Wno-missing-home-modules
resolver: lts-17.5
nix:
packages: [zlib]
```

Then stack install.

> As an aside, while I fully understand the wish not having to include a huge external C library by default, I feel that pandoc’s default sorting algorithm, currently based on “i;unicode-casemap” (RFC 5051), is somewhat below par. In particular, it does not even comply with mainstream English-language rules as far accented characters are concerned. The Chicago Manual of Style (17e, 2017, 16.67) unambiguously states: “Words beginning with or including accented letters are alphabetized as though they were unaccented.” One of their examples gives the sort order “Ubeda – Über – Ubina“. Without icu support, pandoc incorrectly sort this as “Ubeda – Ubina – Über“.

Yes. I agree. Actually, if we just need special treatment for
English locales, then I don't think it should be too hard. We
can use the Haskell unicode-transforms library (already a
dependency of pandoc) to normalize the text and then remove
accents:

Prelude Data.Text.Normalize Data.Text Data.Char> Data.Text.filter (not . isMark) $ normalize NFD "dérégler"
"deregler"

We could sort on the result of that transform.

(This method would also affect non-Western scripts, though, and
I don't know what the rules around those are...)

For non-English locales, would we want to fall back to RFC 5051?

I'm not sure what all the relevant rules are; if it's not too
terribly complicated, I wonder if a pure Haskell library could
be cooked up. It's a shame that there's no way to do proper
unicode collation in Haskell without the difficult icu4
dependency.

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/5035db2e-16b9-4923-8e38-d95b81d27840n%40googlegroups.com.