I'm not an expert in this, but I believe a pure Haskell solution mean
implementing the Unicode Collation Algorithm
. The Unicode Common Locale Data
Repository contains the per-locale settings to
configure the algorithm to sort according to the locale's rules. This is
what ICU does.
On Monday, March 22, 2021 at 6:56:04 AM UTC+1 John MacFarlane wrote:
> "'Nick Bart' via pandoc-discuss"
> writes:
>
> > An unofficial fork of text-icu claims to have fixed the issue (
> https://github.com/WorldSEnder/text-icu/commit/7657227a7ca8ad13be86db5c190b806774a5fd6b
> ).
> >
> > I wonder if anyone could indicate how to tweak the pandoc install
> command to include, for the time being, the WorldSEnder/text-icu fork
> rather than the official one - or whether there is anything else I could
> try to fix this issue on the pandoc side. (I tried downgrading icu4c via
> homebrew, but apparenty no formulae for earlier versions are available.)
>
> Replace stack.yaml with this:
>
>
> ``` stack.yaml
> flags:
> pandoc:
> trypandoc: false
> embed_data_files: true
> QuickCheck:
> old-random: false
> citeproc:
> icu: true
> packages:
> - '.'
> extra-deps:
> - hslua-1.3.0
> - hslua-module-path-0.1.0
> - jira-wiki-markup-1.3.4
> - skylighting-core-0.10.5
> - skylighting-0.10.5
> - doclayout-0.3.0.2
> - citeproc-0.3.0.9
> - texmath-0.12.2
> - random-1.2.0
> - git: https://github.com/WorldSEnder/text-icu
> commit: 7657227a7ca8ad13be86db5c190b806774a5fd6b
> ghc-options:
> "$locals": -fhide-source-paths -Wno-missing-home-modules
> resolver: lts-17.5
> nix:
> packages: [zlib]
> ```
>
> Then stack install.
>
> > As an aside, while I fully understand the wish not having to include a
> huge external C library by default, I feel that pandoc’s default sorting
> algorithm, currently based on “i;unicode-casemap” (RFC 5051), is somewhat
> below par. In particular, it does not even comply with mainstream
> English-language rules as far accented characters are concerned. The
> Chicago Manual of Style (17e, 2017, 16.67) unambiguously states: “Words
> beginning with or including accented letters are alphabetized as though
> they were unaccented.” One of their examples gives the sort order “Ubeda –
> Über – Ubina“. Without icu support, pandoc incorrectly sort this as “Ubeda
> – Ubina – Über“.
>
> Yes. I agree. Actually, if we just need special treatment for
> English locales, then I don't think it should be too hard. We
> can use the Haskell unicode-transforms library (already a
> dependency of pandoc) to normalize the text and then remove
> accents:
>
> Prelude Data.Text.Normalize Data.Text Data.Char> Data.Text.filter (not .
> isMark) $ normalize NFD "dérégler"
> "deregler"
>
> We could sort on the result of that transform.
>
> (This method would also affect non-Western scripts, though, and
> I don't know what the rules around those are...)
>
> For non-English locales, would we want to fall back to RFC 5051?
>
> I'm not sure what all the relevant rules are; if it's not too
> terribly complicated, I wonder if a pure Haskell library could
> be cooked up. It's a shame that there's no way to do proper
> unicode collation in Haskell without the difficult icu4
> dependency.
>
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/5035db2e-16b9-4923-8e38-d95b81d27840n%40googlegroups.com.