From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/27991 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: jcr Newsgroups: gmane.text.pandoc Subject: Re: Error compiling with icu support / possible workaround? Date: Mon, 22 Mar 2021 13:29:56 -0700 (PDT) Message-ID: <5035db2e-16b9-4923-8e38-d95b81d27840n@googlegroups.com> References: Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_70_834752101.1616444996549" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="32344"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBDQLFNXWVILBBRP44OBAMGQEMVSOOHA-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mon Mar 22 21:30:00 2021 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-oi1-f186.google.com ([209.85.167.186]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1lORBX-0008Hd-S0 for gtp-pandoc-discuss@m.gmane-mx.org; Mon, 22 Mar 2021 21:29:59 +0100 Original-Received: by mail-oi1-f186.google.com with SMTP id w16sf22497538oiw.9 for ; Mon, 22 Mar 2021 13:29:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20161025; h=sender:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=5cjg9V4L+S9a3YFZQfwjyATkMt+RNa3pOWbZxXlJdMM=; b=cKmeiVQ0e72hKWdasqtQ5hUlO+CouCAwbg2G4jxs4jtdJ31QayrS6GarNCxik/p9Ds u6PatwvvY9LvZkEvD3IfOWidb6dB6HliYIs3dyTsA+dYKDexMhOPb0nXTGhvk1ghjR/t uNpU7Sq4GkL8lZiFI/j4OMbd2MNoNRnP6L/tkmhdM1hiVH+9pY6ad0vqRkj4iMGpO1Sl BN+Y4yWV9xegp3vPAr+M4CcNoEJDNbd+J/HHcrSmLi8m3vQl8ad8JhffpCE+vMPQXkao Lc04Z0Tg9XiFIoEKZ1zqAb8392sOARX0RjJuiZ/coA34V11jhisJNu1liMmnvo8uDmG+ oyng== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:message-id:in-reply-to:references:subject:mime-version :x-original-sender:reply-to:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=5cjg9V4L+S9a3YFZQfwjyATkMt+RNa3pOWbZxXlJdMM=; b=HvgWBfv+y1Cet9MW+BkuVCTmLQ2cDR8SjxoHeOwRTPEuql2tOLokw3L36e8KH63edu dFMhkrT34XfhkN/C+5urLzRSu+fAbpx/cQCBIzWv7YPVklD5k2Zt6O+LvrapMX/04yJW QD0Sdtn+7RGFZGU0spFb+CSn2p63YMs6OZ8u7hII6g0WhZKW6RftVzY61iM5w6CeSBou KyymOKTT8nao0DM5wJ7lqJ720Xg3f/RwjHZJAqjTycFbWmf8GAbJs9iLJMogs90+9iEq sStzuffKbT/kA+ZK1ujra90sD6gwIvMh0FLpzf+BNyltp2xivCjS2KlEN0BnRG4ec9Qk lq9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to :references:subject:mime-version:x-original-sender:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=5cjg9V4L+S9a3YFZQfwjyATkMt+RNa3pOWbZxXlJdMM=; b=nQkOXka9v2uSuqFqXVFov4b5/Vhi7pc2EPd5+k4hgTwRgmf3ioj7O0gtlzTwjzCThW 6Ly8BbEFtjCUw9GlRVLc6RvJVX5Xre2nkW++Fn3X8WgJbajq3JuTAdMciOnORm9UQRMt 0IWrmxF1ORMhV4XSJA7CiDEniRhChP0ndfgy2Dsd7oZTByCZosDeAMVIUZVOoAVIGa8f joi9cEhuxK/q8O04+88BtftwMtmMpi///mUrlAJ02+FM1yTrJvoCawa+NUkfLuSkNgC9 XWwVcGoMycKyk5bj998fqNaBZagXdzenQz/+MxBmLHhxIlyHcCcGr2UYonCORF62wOZA RFvw== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AOAM533NEeVGqt6eUa69z5uRxakTbPxIXcWdPBYsG0EZjGqNuZmczPXl /BWbEs6oC0QSm5YFBkWEpr4= X-Google-Smtp-Source: ABdhPJxdLrACj1wHVO0az8CpA84TOh8R+QmJbHM1zYIEBvcf/P+K93frBMm7V3Nf8SLovJiVu/Il0A== X-Received: by 2002:a4a:eb8a:: with SMTP id d10mr994845ooj.32.1616444998904; Mon, 22 Mar 2021 13:29:58 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a05:6830:3113:: with SMTP id b19ls3296250ots.5.gmail; Mon, 22 Mar 2021 13:29:57 -0700 (PDT) X-Received: by 2002:a05:6830:4b0:: with SMTP id l16mr1362300otd.96.1616444997285; Mon, 22 Mar 2021 13:29:57 -0700 (PDT) In-Reply-To: X-Original-Sender: ffi.appdev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:27991 Archived-At: ------=_Part_70_834752101.1616444996549 Content-Type: multipart/alternative; boundary="----=_Part_71_1968142207.1616444996549" ------=_Part_71_1968142207.1616444996549 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I'm not an expert in this, but I believe a pure Haskell solution mean=20 implementing the Unicode Collation Algorithm=20 . The Unicode Common Locale Data=20 Repository contains the per-locale settings to= =20 configure the algorithm to sort according to the locale's rules. This is=20 what ICU does. On Monday, March 22, 2021 at 6:56:04 AM UTC+1 John MacFarlane wrote: > "'Nick Bart' via pandoc-discuss" > writes: > > > An unofficial fork of text-icu claims to have fixed the issue ( > https://github.com/WorldSEnder/text-icu/commit/7657227a7ca8ad13be86db5c19= 0b806774a5fd6b > ). > > > > I wonder if anyone could indicate how to tweak the pandoc install=20 > command to include, for the time being, the WorldSEnder/text-icu fork=20 > rather than the official one - or whether there is anything else I could= =20 > try to fix this issue on the pandoc side. (I tried downgrading icu4c via= =20 > homebrew, but apparenty no formulae for earlier versions are available.) > > Replace stack.yaml with this: > > > ``` stack.yaml > flags: > pandoc: > trypandoc: false > embed_data_files: true > QuickCheck: > old-random: false > citeproc: > icu: true > packages: > - '.' > extra-deps: > - hslua-1.3.0 > - hslua-module-path-0.1.0 > - jira-wiki-markup-1.3.4 > - skylighting-core-0.10.5 > - skylighting-0.10.5 > - doclayout-0.3.0.2 > - citeproc-0.3.0.9 > - texmath-0.12.2 > - random-1.2.0 > - git: https://github.com/WorldSEnder/text-icu > commit: 7657227a7ca8ad13be86db5c190b806774a5fd6b > ghc-options: > "$locals": -fhide-source-paths -Wno-missing-home-modules > resolver: lts-17.5 > nix: > packages: [zlib] > ``` > > Then stack install. > > > As an aside, while I fully understand the wish not having to include a= =20 > huge external C library by default, I feel that pandoc=E2=80=99s default = sorting=20 > algorithm, currently based on =E2=80=9Ci;unicode-casemap=E2=80=9D (RFC 50= 51), is somewhat=20 > below par. In particular, it does not even comply with mainstream=20 > English-language rules as far accented characters are concerned. The=20 > Chicago Manual of Style (17e, 2017, 16.67) unambiguously states: =E2=80= =9CWords=20 > beginning with or including accented letters are alphabetized as though= =20 > they were unaccented.=E2=80=9D One of their examples gives the sort order= =E2=80=9CUbeda =E2=80=93=20 > =C3=9Cber =E2=80=93 Ubina=E2=80=9C. Without icu support, pandoc incorrect= ly sort this as =E2=80=9CUbeda=20 > =E2=80=93 Ubina =E2=80=93 =C3=9Cber=E2=80=9C. > > Yes. I agree. Actually, if we just need special treatment for > English locales, then I don't think it should be too hard. We > can use the Haskell unicode-transforms library (already a > dependency of pandoc) to normalize the text and then remove > accents: > > Prelude Data.Text.Normalize Data.Text Data.Char> Data.Text.filter (not .= =20 > isMark) $ normalize NFD "d=C3=A9r=C3=A9gler" > "deregler" > > We could sort on the result of that transform. > > (This method would also affect non-Western scripts, though, and > I don't know what the rules around those are...) > > For non-English locales, would we want to fall back to RFC 5051? > > I'm not sure what all the relevant rules are; if it's not too > terribly complicated, I wonder if a pure Haskell library could > be cooked up. It's a shame that there's no way to do proper > unicode collation in Haskell without the difficult icu4 > dependency. > --=20 You received this message because you are subscribed to the Google Groups "= pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/= pandoc-discuss/5035db2e-16b9-4923-8e38-d95b81d27840n%40googlegroups.com. ------=_Part_71_1968142207.1616444996549 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I'm not an expert in this, but I believe a pure Haskell solution mean imple= menting the Unicode Collation= Algorithm. The Unicode Common Loc= ale Data Repository contains the per-locale settings to configure the a= lgorithm to sort according to the locale's rules. This is what ICU does.
On M= onday, March 22, 2021 at 6:56:04 AM UTC+1 John MacFarlane wrote:
=
"'Nick Bart= 9; via pandoc-discuss"
<pandoc-...@googlegroups.= com> writes:

> An unofficial fork of text-icu claims to have fixed the issue (https://github.com/WorldSEnder/text-icu/commit/7657227a7ca8ad13be86db5c= 190b806774a5fd6b).
>
> I wonder if anyone could indicate how to tweak the pandoc install = command to include, for the time being, the WorldSEnder/text-icu fork rathe= r than the official one - or whether there is anything else I could try to = fix this issue on the pandoc side. (I tried downgrading icu4c via homebrew,= but apparenty no formulae for earlier versions are available.)

Replace stack.yaml with this:


``` stack.yaml
flags:
pandoc:
trypandoc: false
embed_data_files: true
QuickCheck:
old-random: false
citeproc:
icu: true
packages:
- '.'
extra-deps:
- hslua-1.3.0
- hslua-module-path-0.1.0
- jira-wiki-markup-1.3.4
- skylighting-core-0.10.5
- skylighting-0.10.5
- doclayout-0.3.0.2
- citeproc-0.3.0.9
- texmath-0.12.2
- random-1.2.0
- git: htt= ps://github.com/WorldSEnder/text-icu
commit: 7657227a7ca8ad13be86db5c190b806774a5fd6b
ghc-options:
"$locals": -fhide-source-paths -Wno-missing-home-modules
resolver: lts-17.5
nix:
packages: [zlib]
```

Then stack install.

> As an aside, while I fully understand the wish not having to inclu= de a huge external C library by default, I feel that pandoc=E2=80=99s defau= lt sorting algorithm, currently based on =E2=80=9Ci;unicode-casemap=E2=80= =9D (RFC 5051), is somewhat below par. In particular, it does not even comp= ly with mainstream English-language rules as far accented characters are co= ncerned. The Chicago Manual of Style (17e, 2017, 16.67) unambiguously state= s: =E2=80=9CWords beginning with or including accented letters are alphabet= ized as though they were unaccented.=E2=80=9D One of their examples gives t= he sort order =E2=80=9CUbeda =E2=80=93 =C3=9Cber =E2=80=93 Ubina=E2=80=9C. = Without icu support, pandoc incorrectly sort this as =E2=80=9CUbeda =E2=80= =93 Ubina =E2=80=93 =C3=9Cber=E2=80=9C.

Yes. I agree. Actually, if we just need special treatment for
English locales, then I don't think it should be too hard. We
can use the Haskell unicode-transforms library (already a
dependency of pandoc) to normalize the text and then remove
accents:

Prelude Data.Text.Normalize Data.Text Data.Char> Data.Text.filter (n= ot . isMark) $ normalize NFD "d=C3=A9r=C3=A9gler"
"deregler"

We could sort on the result of that transform.

(This method would also affect non-Western scripts, though, and
I don't know what the rules around those are...)

For non-English locales, would we want to fall back to RFC 5051?

I'm not sure what all the relevant rules are; if it's not too
terribly complicated, I wonder if a pure Haskell library could
be cooked up. It's a shame that there's no way to do proper
unicode collation in Haskell without the difficult icu4
dependency.

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d= /msgid/pandoc-discuss/5035db2e-16b9-4923-8e38-d95b81d27840n%40googlegroups.= com.
------=_Part_71_1968142207.1616444996549-- ------=_Part_70_834752101.1616444996549--