I'm not an expert in this, but I believe a pure Haskell solution mean implementing the Unicode Collation Algorithm . The Unicode Common Locale Data Repository contains the per-locale settings to configure the algorithm to sort according to the locale's rules. This is what ICU does. On Monday, March 22, 2021 at 6:56:04 AM UTC+1 John MacFarlane wrote: > "'Nick Bart' via pandoc-discuss" > writes: > > > An unofficial fork of text-icu claims to have fixed the issue ( > https://github.com/WorldSEnder/text-icu/commit/7657227a7ca8ad13be86db5c190b806774a5fd6b > ). > > > > I wonder if anyone could indicate how to tweak the pandoc install > command to include, for the time being, the WorldSEnder/text-icu fork > rather than the official one - or whether there is anything else I could > try to fix this issue on the pandoc side. (I tried downgrading icu4c via > homebrew, but apparenty no formulae for earlier versions are available.) > > Replace stack.yaml with this: > > > ``` stack.yaml > flags: > pandoc: > trypandoc: false > embed_data_files: true > QuickCheck: > old-random: false > citeproc: > icu: true > packages: > - '.' > extra-deps: > - hslua-1.3.0 > - hslua-module-path-0.1.0 > - jira-wiki-markup-1.3.4 > - skylighting-core-0.10.5 > - skylighting-0.10.5 > - doclayout-0.3.0.2 > - citeproc-0.3.0.9 > - texmath-0.12.2 > - random-1.2.0 > - git: https://github.com/WorldSEnder/text-icu > commit: 7657227a7ca8ad13be86db5c190b806774a5fd6b > ghc-options: > "$locals": -fhide-source-paths -Wno-missing-home-modules > resolver: lts-17.5 > nix: > packages: [zlib] > ``` > > Then stack install. > > > As an aside, while I fully understand the wish not having to include a > huge external C library by default, I feel that pandoc’s default sorting > algorithm, currently based on “i;unicode-casemap” (RFC 5051), is somewhat > below par. In particular, it does not even comply with mainstream > English-language rules as far accented characters are concerned. The > Chicago Manual of Style (17e, 2017, 16.67) unambiguously states: “Words > beginning with or including accented letters are alphabetized as though > they were unaccented.” One of their examples gives the sort order “Ubeda – > Über – Ubina“. Without icu support, pandoc incorrectly sort this as “Ubeda > – Ubina – Über“. > > Yes. I agree. Actually, if we just need special treatment for > English locales, then I don't think it should be too hard. We > can use the Haskell unicode-transforms library (already a > dependency of pandoc) to normalize the text and then remove > accents: > > Prelude Data.Text.Normalize Data.Text Data.Char> Data.Text.filter (not . > isMark) $ normalize NFD "dérégler" > "deregler" > > We could sort on the result of that transform. > > (This method would also affect non-Western scripts, though, and > I don't know what the rules around those are...) > > For non-English locales, would we want to fall back to RFC 5051? > > I'm not sure what all the relevant rules are; if it's not too > terribly complicated, I wonder if a pure Haskell library could > be cooked up. It's a shame that there's no way to do proper > unicode collation in Haskell without the difficult icu4 > dependency. > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/5035db2e-16b9-4923-8e38-d95b81d27840n%40googlegroups.com.