public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: Gwern Branwen <gwern-v26ZT+9V8bxeoWH0uzbU5w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: Removing parts of the document with [walk] in Haskell
Date: Sat, 4 Sep 2021 15:26:51 -0400	[thread overview]
Message-ID: <CAMwO0gzkw0Xmsb0R1orjc8z3z32WBi60Mj+kYeGpKfE7epMBtQ@mail.gmail.com> (raw)
In-Reply-To: <915213fc-e4c4-480b-a6a2-fd3420777ddan-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>

You might find my own struggles with similar issues interesting:
https://groups.google.com/g/pandoc-discuss/c/jgb-Q0F2p1Y/m/1DogIEAtAQAJ
https://github.com/gwern/gwern.net/blob/master/build/Typography.hs

A few things I'll note: your `isThe` only works if you assume that all
strings have been fully split into Str/Space/Str sequences. But it's
entirely valid to have a Str containing spaces, like `Str " the "`;
your rewrite would fail there. So in your example, you could do
something like `isThe (Str s) = if "the" `T.isInfixOf` s then Str
(T.replace "the" "" s) else (Str s)` and then simply `removeThe = walk
isThe`. (I don't know if it meaningfully differs in performance from
filtering but if it does, it's probably faster. May still be slower
than your wrong version.)

> The same question actually applies to any transformation which "changes the number of elements":

This is a general problem I've beaten my head against many times. If
you do the concat trick, you start running into issues with how the
`walk` operates exactly so it'll sometimes repeat operations or omit
substitutions you expected to happen, and bottomUp/topDown sometimes
cause issues like their own different omissions or just infinite loops
(always fun).

The best approach I've found is to *not* change the number of
elements, using instead `Inline -> Inline` by using 'the Span trick':
'Span' is the *only* Inline which can wrap multiple Inlines without
meaningfully affecting them. (If you wrap them in Para or Div, it's a
Block; if you wrap them in Italics, they're italicized; if you wrap
them in Str, you need to compile down to a flat string deleting all
formatting etc.)

This is how I do auto-smallcaps, inserting wordbreak characters after
slashes, automatically turning matching text into hyperlinks, etc: I
walk the tree, check Strs for matches of a regexp or character, either
return a Str unaltered (so still an Inline) or return a Span (still an
Inline) whose '[Inline]' payload can be anything like '[Str "foo",
Span "smallcaps-auto" [Str "SMALLCAPS"], Str "bar"]' or '[Link
nullAttr ["GPU"]
("https://en.wikipedia.org/wiki/Graphics_processing_unit","")]'... The
walk typechecks, the substitution happens consistently as you expect
without infinite loops or missing or redundant substitutions, and it
isn't too hard to understand.

The drawback is that Span wrappers aren't great in terms of
efficiency, but correctness is more important than speed, and you can
always postprocess the HTML to remove the <span> wrappers not doing
anything.

(You can substitute in more powerful rewrites by going to raw HTML but
it gets very difficult when you work with `RawInline`, because now you
have to be very careful about the opacity of the Text blob inside
RawInline and the ordering and accidentally clobbering your HTML
literals or messing them up. Try to stick to Span rewrites if at all
possible.)

-- 
gwern
https://www.gwern.net


  parent reply	other threads:[~2021-09-04 19:26 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-04 19:09 Ilia Zaihcuk
     [not found] ` <915213fc-e4c4-480b-a6a2-fd3420777ddan-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-09-04 19:26   ` Gwern Branwen [this message]
2021-09-05  5:23   ` John MacFarlane
     [not found]     ` <m2mtorlhoo.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
2021-09-05 15:09       ` Ilia Zaihcuk
     [not found]         ` <1473b62b-4b33-49b5-ac6d-52a5571d8068n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-09-05 15:42           ` Gwern Branwen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMwO0gzkw0Xmsb0R1orjc8z3z32WBi60Mj+kYeGpKfE7epMBtQ@mail.gmail.com \
    --to=gwern-v26zt+9v8bxeowh0uzbu5w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).