Removing parts of the document with [walk] in Haskell

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* Removing parts of the document with [walk] in Haskell
@ 2021-09-04 19:09 Ilia Zaihcuk
       [not found] ` <915213fc-e4c4-480b-a6a2-fd3420777ddan-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Ilia Zaihcuk @ 2021-09-04 19:09 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1.1: Type: text/plain, Size: 1930 bytes --]

Hi all,

Using pandoc-types' Walkable 
<https://hackage.haskell.org/package/pandoc-types-1.22/docs/Text-Pandoc-Walk.html#t:Walkable> 
class, what is the best-performing/most idiomatic way to filter out certain 
elements from a document?

Say I wanted to remove all occurrences of the word "the" from a Pandoc. The 
best implementation I've been able to write for this is

isThe :: Inline -> Bool 
isThe (Str "the") = True
isThe _ = False

removeThe :: Pandoc -> Pandoc
removeThe = walk (filter $ not . isThe)

Is this right? Does the use of filter here not mean an additional O(n) 
traversal happens on every list of Inlines?

I feel like a better solution for this would be using

filterThe :: Inline -> [Inline] 
filterThe (Str "the") = []
filterThe x = [x]

or something similar, but of course

removeThe = walk filterThe

does not typecheck. We need a -> a, meaning [Inline] -> [Inline] here.

The same question actually applies to any transformation which "changes the 
number of elements":

theFine :: Inline -> [Inline] 
theFine (Str "the") = [Str "the", Space, Str "fine"]
theFine i = [i]

allIsFine :: Pandoc -> Pandoc
allIsFine = walk $ concatMap theFine

Best,

Ilia

P.S. Sorry if this has been asked before. I feel it must be a common issue, 
but all I've been able to find is this thread 
<https://groups.google.com/g/pandoc-discuss/c/idlbnOk1ooE/m/_QVvaRprHVcJ> 
with its links, which are 7 years old now and seem to be calling for 
changes in pandoc. Everything else suggests stepping outside Haskell.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/915213fc-e4c4-480b-a6a2-fd3420777ddan%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 3865 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Removing parts of the document with [walk] in Haskell
       [not found] ` <915213fc-e4c4-480b-a6a2-fd3420777ddan-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2021-09-04 19:26   ` Gwern Branwen
  2021-09-05  5:23   ` John MacFarlane
  1 sibling, 0 replies; 5+ messages in thread
From: Gwern Branwen @ 2021-09-04 19:26 UTC (permalink / raw)
  To: pandoc-discuss

You might find my own struggles with similar issues interesting:
https://groups.google.com/g/pandoc-discuss/c/jgb-Q0F2p1Y/m/1DogIEAtAQAJ
https://github.com/gwern/gwern.net/blob/master/build/Typography.hs

A few things I'll note: your `isThe` only works if you assume that all
strings have been fully split into Str/Space/Str sequences. But it's
entirely valid to have a Str containing spaces, like `Str " the "`;
your rewrite would fail there. So in your example, you could do
something like `isThe (Str s) = if "the" `T.isInfixOf` s then Str
(T.replace "the" "" s) else (Str s)` and then simply `removeThe = walk
isThe`. (I don't know if it meaningfully differs in performance from
filtering but if it does, it's probably faster. May still be slower
than your wrong version.)

> The same question actually applies to any transformation which "changes the number of elements":

This is a general problem I've beaten my head against many times. If
you do the concat trick, you start running into issues with how the
`walk` operates exactly so it'll sometimes repeat operations or omit
substitutions you expected to happen, and bottomUp/topDown sometimes
cause issues like their own different omissions or just infinite loops
(always fun).

The best approach I've found is to *not* change the number of
elements, using instead `Inline -> Inline` by using 'the Span trick':
'Span' is the *only* Inline which can wrap multiple Inlines without
meaningfully affecting them. (If you wrap them in Para or Div, it's a
Block; if you wrap them in Italics, they're italicized; if you wrap
them in Str, you need to compile down to a flat string deleting all
formatting etc.)

This is how I do auto-smallcaps, inserting wordbreak characters after
slashes, automatically turning matching text into hyperlinks, etc: I
walk the tree, check Strs for matches of a regexp or character, either
return a Str unaltered (so still an Inline) or return a Span (still an
Inline) whose '[Inline]' payload can be anything like '[Str "foo",
Span "smallcaps-auto" [Str "SMALLCAPS"], Str "bar"]' or '[Link
nullAttr ["GPU"]
("https://en.wikipedia.org/wiki/Graphics_processing_unit","")]'... The
walk typechecks, the substitution happens consistently as you expect
without infinite loops or missing or redundant substitutions, and it
isn't too hard to understand.

The drawback is that Span wrappers aren't great in terms of
efficiency, but correctness is more important than speed, and you can
always postprocess the HTML to remove the <span> wrappers not doing
anything.

(You can substitute in more powerful rewrites by going to raw HTML but
it gets very difficult when you work with `RawInline`, because now you
have to be very careful about the opacity of the Text blob inside
RawInline and the ordering and accidentally clobbering your HTML
literals or messing them up. Try to stick to Span rewrites if at all
possible.)

-- 
gwern
https://www.gwern.net

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Removing parts of the document with [walk] in Haskell
       [not found] ` <915213fc-e4c4-480b-a6a2-fd3420777ddan-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2021-09-04 19:26   ` Gwern Branwen
@ 2021-09-05  5:23   ` John MacFarlane
       [not found]     ` <m2mtorlhoo.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
  1 sibling, 1 reply; 5+ messages in thread
From: John MacFarlane @ 2021-09-05  5:23 UTC (permalink / raw)
  To: Ilia Zaihcuk, pandoc-discuss


I have often used the method

> allIsFine = walk $ concatMap theFine

But I don't think this will be any more efficient than your
first version with `walk (filter $ not . isThe)`.
Actually, I'd be interested in seeing benchmarks with these
approaches, if efficiency matters enough to you to research
this more.

Another possibility is to walk with a function like

killThe :: Inline -> Inline
killThe (Str "the") = Str ""  -- or: Span ("",[],[]) []
killThe x = x

This should be more efficient than the list versions, though
again it would be interesting to see benchmarks.



Ilia Zaihcuk <zoickx-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> Hi all,
>
> Using pandoc-types' Walkable 
> <https://hackage.haskell.org/package/pandoc-types-1.22/docs/Text-Pandoc-Walk.html#t:Walkable> 
> class, what is the best-performing/most idiomatic way to filter out certain 
> elements from a document?
>
> Say I wanted to remove all occurrences of the word "the" from a Pandoc. The 
> best implementation I've been able to write for this is
>
> isThe :: Inline -> Bool 
> isThe (Str "the") = True
> isThe _ = False
>
> removeThe :: Pandoc -> Pandoc
> removeThe = walk (filter $ not . isThe)
>
> Is this right? Does the use of filter here not mean an additional O(n) 
> traversal happens on every list of Inlines?
>
> I feel like a better solution for this would be using
>
> filterThe :: Inline -> [Inline] 
> filterThe (Str "the") = []
> filterThe x = [x]
>
> or something similar, but of course
>
> removeThe = walk filterThe
>
> does not typecheck. We need a -> a, meaning [Inline] -> [Inline] here.
>
>
> The same question actually applies to any transformation which "changes the 
> number of elements":
>
> theFine :: Inline -> [Inline] 
> theFine (Str "the") = [Str "the", Space, Str "fine"]
> theFine i = [i]
>
> allIsFine :: Pandoc -> Pandoc
> allIsFine = walk $ concatMap theFine
>
> Best,
>
> Ilia
>
> P.S. Sorry if this has been asked before. I feel it must be a common issue, 
> but all I've been able to find is this thread 
> <https://groups.google.com/g/pandoc-discuss/c/idlbnOk1ooE/m/_QVvaRprHVcJ> 
> with its links, which are 7 years old now and seem to be calling for 
> changes in pandoc. Everything else suggests stepping outside Haskell.
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/915213fc-e4c4-480b-a6a2-fd3420777ddan%40googlegroups.com.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Removing parts of the document with [walk] in Haskell
       [not found]     ` <m2mtorlhoo.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
@ 2021-09-05 15:09       ` Ilia Zaihcuk
       [not found]         ` <1473b62b-4b33-49b5-ac6d-52a5571d8068n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Ilia Zaihcuk @ 2021-09-05 15:09 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 4838 bytes --]

Thanks for the detailed replies!

The trick with using Span is neat, although it does seem more like a 
workaround than a true solution - it *is* making additional changes to the 
AST after all. Can I be sure that they won't influence my output in 
unexpected ways? (e.g. some obscure issue like auto-hyphenation in LaTeX, 
page breaks, etc.)

@jgm, I don't really care about performance that much in my application, 
but I did do some benchmarking by your suggestion: here's a separate repo 
<https://git.sr.ht/~zoickx/pandoc-benchmark>. This is actually my first 
ever encounter with the GHC profiler, so it's rather crude. Do let me know 
if you'd like any more info. TLDR: it does look like using the Span trick 
is more performant than concatMap (%time 0.8 vs 0.9), while the number of 
invocations of the base function (theFine in my original example) remains 
perfectly unchanged.

@gwern
*> If you do the concat trick, you start running into issues with how the 
`walk` operates exactly so it'll sometimes repeat operations or omit 
substitutions you expected to happen, and bottomUp/topDown sometimes cause 
issues like their own different omissions or just infinite loops (always 
fun).*
This is scary! Correctness is certainly more important than speed. Do you 
have a concrete example where this issue occurs? If @jgm (the original 
author of pandoc!) is using this method and there are cases in which it 
behaves unexpectedly, this sounds worthy of an issue and/or a pull request 
on GitHub.
On Sunday, September 5, 2021 at 8:23:19 AM UTC+3 John MacFarlane wrote:

>
> I have often used the method
>
> > allIsFine = walk $ concatMap theFine
>
> But I don't think this will be any more efficient than your
> first version with `walk (filter $ not . isThe)`.
> Actually, I'd be interested in seeing benchmarks with these
> approaches, if efficiency matters enough to you to research
> this more.
>
> Another possibility is to walk with a function like
>
> killThe :: Inline -> Inline
> killThe (Str "the") = Str "" -- or: Span ("",[],[]) []
> killThe x = x
>
> This should be more efficient than the list versions, though
> again it would be interesting to see benchmarks.
>
>
>
> Ilia Zaihcuk <zoi...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>
> > Hi all,
> >
> > Using pandoc-types' Walkable 
> > <
> https://hackage.haskell.org/package/pandoc-types-1.22/docs/Text-Pandoc-Walk.html#t:Walkable> 
>
> > class, what is the best-performing/most idiomatic way to filter out 
> certain 
> > elements from a document?
> >
> > Say I wanted to remove all occurrences of the word "the" from a Pandoc. 
> The 
> > best implementation I've been able to write for this is
> >
> > isThe :: Inline -> Bool 
> > isThe (Str "the") = True
> > isThe _ = False
> >
> > removeThe :: Pandoc -> Pandoc
> > removeThe = walk (filter $ not . isThe)
> >
> > Is this right? Does the use of filter here not mean an additional O(n) 
> > traversal happens on every list of Inlines?
> >
> > I feel like a better solution for this would be using
> >
> > filterThe :: Inline -> [Inline] 
> > filterThe (Str "the") = []
> > filterThe x = [x]
> >
> > or something similar, but of course
> >
> > removeThe = walk filterThe
> >
> > does not typecheck. We need a -> a, meaning [Inline] -> [Inline] here.
> >
> >
> > The same question actually applies to any transformation which "changes 
> the 
> > number of elements":
> >
> > theFine :: Inline -> [Inline] 
> > theFine (Str "the") = [Str "the", Space, Str "fine"]
> > theFine i = [i]
> >
> > allIsFine :: Pandoc -> Pandoc
> > allIsFine = walk $ concatMap theFine
> >
> > Best,
> >
> > Ilia
> >
> > P.S. Sorry if this has been asked before. I feel it must be a common 
> issue, 
> > but all I've been able to find is this thread 
> > <https://groups.google.com/g/pandoc-discuss/c/idlbnOk1ooE/m/_QVvaRprHVcJ> 
>
> > with its links, which are 7 years old now and seem to be calling for 
> > changes in pandoc. Everything else suggests stepping outside Haskell.
> >
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/915213fc-e4c4-480b-a6a2-fd3420777ddan%40googlegroups.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/1473b62b-4b33-49b5-ac6d-52a5571d8068n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 7416 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Removing parts of the document with [walk] in Haskell
       [not found]         ` <1473b62b-4b33-49b5-ac6d-52a5571d8068n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2021-09-05 15:42           ` Gwern Branwen
  0 siblings, 0 replies; 5+ messages in thread
From: Gwern Branwen @ 2021-09-05 15:42 UTC (permalink / raw)
  To: pandoc-discuss

On Sun, Sep 5, 2021 at 11:09 AM Ilia Zaihcuk <zoickx-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> The trick with using Span is neat, although it does seem more like a workaround than a true solution - it is making additional changes to the AST after all. Can I be sure that they won't influence my output in unexpected ways? (e.g. some obscure issue like auto-hyphenation in LaTeX, page breaks, etc.)

I only use HTML output, so I have no idea what other targets like
LaTeX do, or if they even output spans at all. Using the span trick
necessarily introduces spans around the rewritten text, but I think
this is generally desirable. (The whole point of the smallcaps pass is
to add marked-up spans for the CSS to apply a smallcaps variant to,
and the spans allow me to suppress redundant definitions of terms when
doing link rewrites.) The spans might change the necessary CSS for
something or other, but I haven't run into any edge-cases there.

If the span wrappers worry you, you can do an additional pass to
change them, but you are going to run into the basic 'expression
problem' https://en.wikipedia.org/wiki/Expression_problem regardless:
a strong static FP language like Haskell makes it extremely easy and
downright trivial to write a function like 'Inline -> Inline' to plug
into 'Pandoc -> Pandoc', but the cost of this is that the datatype
'Inline' or 'Pandoc' cannot be changed easily.

(But at least doing a cleanup pass scales well in terms of effort:
define all the rewrites you need with the Span trick easily, and then
write a single painful cleanup pass.)

> This is scary! Correctness is certainly more important than speed. Do you have a concrete example where this issue occurs? If @jgm (the original author of pandoc!) is using this method and there are cases in which it behaves unexpectedly, this sounds worthy of an issue and/or a pull request on GitHub.

I think the recursion patterns tend to work worst when you are adding
in new nodes so that it adds a new node, then descends into it,
applies the same rewrite adding new nodes, descends into them... Stuff
like that. Not necessarily wrong, but subtle and requiring a lot of
understanding to know which one to use.

-- 
gwern
https://www.gwern.net

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAMwO0gwipJYRLBdt%3DMOcn_%3Dc_xWu5q-9Y54BZ%2Bvr0SH7anAHaw%40mail.gmail.com.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-09-05 15:42 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-04 19:09 Removing parts of the document with [walk] in Haskell Ilia Zaihcuk
     [not found] ` <915213fc-e4c4-480b-a6a2-fd3420777ddan-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-09-04 19:26   ` Gwern Branwen
2021-09-05  5:23   ` John MacFarlane
     [not found]     ` <m2mtorlhoo.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
2021-09-05 15:09       ` Ilia Zaihcuk
     [not found]         ` <1473b62b-4b33-49b5-ac6d-52a5571d8068n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-09-05 15:42           ` Gwern Branwen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).