* Removing parts of the document with [walk] in Haskell @ 2021-09-04 19:09 Ilia Zaihcuk [not found] ` <915213fc-e4c4-480b-a6a2-fd3420777ddan-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 0 siblings, 1 reply; 5+ messages in thread From: Ilia Zaihcuk @ 2021-09-04 19:09 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 1930 bytes --] Hi all, Using pandoc-types' Walkable <https://hackage.haskell.org/package/pandoc-types-1.22/docs/Text-Pandoc-Walk.html#t:Walkable> class, what is the best-performing/most idiomatic way to filter out certain elements from a document? Say I wanted to remove all occurrences of the word "the" from a Pandoc. The best implementation I've been able to write for this is isThe :: Inline -> Bool isThe (Str "the") = True isThe _ = False removeThe :: Pandoc -> Pandoc removeThe = walk (filter $ not . isThe) Is this right? Does the use of filter here not mean an additional O(n) traversal happens on every list of Inlines? I feel like a better solution for this would be using filterThe :: Inline -> [Inline] filterThe (Str "the") = [] filterThe x = [x] or something similar, but of course removeThe = walk filterThe does not typecheck. We need a -> a, meaning [Inline] -> [Inline] here. The same question actually applies to any transformation which "changes the number of elements": theFine :: Inline -> [Inline] theFine (Str "the") = [Str "the", Space, Str "fine"] theFine i = [i] allIsFine :: Pandoc -> Pandoc allIsFine = walk $ concatMap theFine Best, Ilia P.S. Sorry if this has been asked before. I feel it must be a common issue, but all I've been able to find is this thread <https://groups.google.com/g/pandoc-discuss/c/idlbnOk1ooE/m/_QVvaRprHVcJ> with its links, which are 7 years old now and seem to be calling for changes in pandoc. Everything else suggests stepping outside Haskell. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/915213fc-e4c4-480b-a6a2-fd3420777ddan%40googlegroups.com. [-- Attachment #1.2: Type: text/html, Size: 3865 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <915213fc-e4c4-480b-a6a2-fd3420777ddan-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: Removing parts of the document with [walk] in Haskell [not found] ` <915213fc-e4c4-480b-a6a2-fd3420777ddan-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2021-09-04 19:26 ` Gwern Branwen 2021-09-05 5:23 ` John MacFarlane 1 sibling, 0 replies; 5+ messages in thread From: Gwern Branwen @ 2021-09-04 19:26 UTC (permalink / raw) To: pandoc-discuss You might find my own struggles with similar issues interesting: https://groups.google.com/g/pandoc-discuss/c/jgb-Q0F2p1Y/m/1DogIEAtAQAJ https://github.com/gwern/gwern.net/blob/master/build/Typography.hs A few things I'll note: your `isThe` only works if you assume that all strings have been fully split into Str/Space/Str sequences. But it's entirely valid to have a Str containing spaces, like `Str " the "`; your rewrite would fail there. So in your example, you could do something like `isThe (Str s) = if "the" `T.isInfixOf` s then Str (T.replace "the" "" s) else (Str s)` and then simply `removeThe = walk isThe`. (I don't know if it meaningfully differs in performance from filtering but if it does, it's probably faster. May still be slower than your wrong version.) > The same question actually applies to any transformation which "changes the number of elements": This is a general problem I've beaten my head against many times. If you do the concat trick, you start running into issues with how the `walk` operates exactly so it'll sometimes repeat operations or omit substitutions you expected to happen, and bottomUp/topDown sometimes cause issues like their own different omissions or just infinite loops (always fun). The best approach I've found is to *not* change the number of elements, using instead `Inline -> Inline` by using 'the Span trick': 'Span' is the *only* Inline which can wrap multiple Inlines without meaningfully affecting them. (If you wrap them in Para or Div, it's a Block; if you wrap them in Italics, they're italicized; if you wrap them in Str, you need to compile down to a flat string deleting all formatting etc.) This is how I do auto-smallcaps, inserting wordbreak characters after slashes, automatically turning matching text into hyperlinks, etc: I walk the tree, check Strs for matches of a regexp or character, either return a Str unaltered (so still an Inline) or return a Span (still an Inline) whose '[Inline]' payload can be anything like '[Str "foo", Span "smallcaps-auto" [Str "SMALLCAPS"], Str "bar"]' or '[Link nullAttr ["GPU"] ("https://en.wikipedia.org/wiki/Graphics_processing_unit","")]'... The walk typechecks, the substitution happens consistently as you expect without infinite loops or missing or redundant substitutions, and it isn't too hard to understand. The drawback is that Span wrappers aren't great in terms of efficiency, but correctness is more important than speed, and you can always postprocess the HTML to remove the <span> wrappers not doing anything. (You can substitute in more powerful rewrites by going to raw HTML but it gets very difficult when you work with `RawInline`, because now you have to be very careful about the opacity of the Text blob inside RawInline and the ordering and accidentally clobbering your HTML literals or messing them up. Try to stick to Span rewrites if at all possible.) -- gwern https://www.gwern.net ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Removing parts of the document with [walk] in Haskell [not found] ` <915213fc-e4c4-480b-a6a2-fd3420777ddan-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2021-09-04 19:26 ` Gwern Branwen @ 2021-09-05 5:23 ` John MacFarlane [not found] ` <m2mtorlhoo.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> 1 sibling, 1 reply; 5+ messages in thread From: John MacFarlane @ 2021-09-05 5:23 UTC (permalink / raw) To: Ilia Zaihcuk, pandoc-discuss I have often used the method > allIsFine = walk $ concatMap theFine But I don't think this will be any more efficient than your first version with `walk (filter $ not . isThe)`. Actually, I'd be interested in seeing benchmarks with these approaches, if efficiency matters enough to you to research this more. Another possibility is to walk with a function like killThe :: Inline -> Inline killThe (Str "the") = Str "" -- or: Span ("",[],[]) [] killThe x = x This should be more efficient than the list versions, though again it would be interesting to see benchmarks. Ilia Zaihcuk <zoickx-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes: > Hi all, > > Using pandoc-types' Walkable > <https://hackage.haskell.org/package/pandoc-types-1.22/docs/Text-Pandoc-Walk.html#t:Walkable> > class, what is the best-performing/most idiomatic way to filter out certain > elements from a document? > > Say I wanted to remove all occurrences of the word "the" from a Pandoc. The > best implementation I've been able to write for this is > > isThe :: Inline -> Bool > isThe (Str "the") = True > isThe _ = False > > removeThe :: Pandoc -> Pandoc > removeThe = walk (filter $ not . isThe) > > Is this right? Does the use of filter here not mean an additional O(n) > traversal happens on every list of Inlines? > > I feel like a better solution for this would be using > > filterThe :: Inline -> [Inline] > filterThe (Str "the") = [] > filterThe x = [x] > > or something similar, but of course > > removeThe = walk filterThe > > does not typecheck. We need a -> a, meaning [Inline] -> [Inline] here. > > > The same question actually applies to any transformation which "changes the > number of elements": > > theFine :: Inline -> [Inline] > theFine (Str "the") = [Str "the", Space, Str "fine"] > theFine i = [i] > > allIsFine :: Pandoc -> Pandoc > allIsFine = walk $ concatMap theFine > > Best, > > Ilia > > P.S. Sorry if this has been asked before. I feel it must be a common issue, > but all I've been able to find is this thread > <https://groups.google.com/g/pandoc-discuss/c/idlbnOk1ooE/m/_QVvaRprHVcJ> > with its links, which are 7 years old now and seem to be calling for > changes in pandoc. Everything else suggests stepping outside Haskell. > > -- > You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/915213fc-e4c4-480b-a6a2-fd3420777ddan%40googlegroups.com. ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <m2mtorlhoo.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>]
* Re: Removing parts of the document with [walk] in Haskell [not found] ` <m2mtorlhoo.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> @ 2021-09-05 15:09 ` Ilia Zaihcuk [not found] ` <1473b62b-4b33-49b5-ac6d-52a5571d8068n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 0 siblings, 1 reply; 5+ messages in thread From: Ilia Zaihcuk @ 2021-09-05 15:09 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 4838 bytes --] Thanks for the detailed replies! The trick with using Span is neat, although it does seem more like a workaround than a true solution - it *is* making additional changes to the AST after all. Can I be sure that they won't influence my output in unexpected ways? (e.g. some obscure issue like auto-hyphenation in LaTeX, page breaks, etc.) @jgm, I don't really care about performance that much in my application, but I did do some benchmarking by your suggestion: here's a separate repo <https://git.sr.ht/~zoickx/pandoc-benchmark>. This is actually my first ever encounter with the GHC profiler, so it's rather crude. Do let me know if you'd like any more info. TLDR: it does look like using the Span trick is more performant than concatMap (%time 0.8 vs 0.9), while the number of invocations of the base function (theFine in my original example) remains perfectly unchanged. @gwern *> If you do the concat trick, you start running into issues with how the `walk` operates exactly so it'll sometimes repeat operations or omit substitutions you expected to happen, and bottomUp/topDown sometimes cause issues like their own different omissions or just infinite loops (always fun).* This is scary! Correctness is certainly more important than speed. Do you have a concrete example where this issue occurs? If @jgm (the original author of pandoc!) is using this method and there are cases in which it behaves unexpectedly, this sounds worthy of an issue and/or a pull request on GitHub. On Sunday, September 5, 2021 at 8:23:19 AM UTC+3 John MacFarlane wrote: > > I have often used the method > > > allIsFine = walk $ concatMap theFine > > But I don't think this will be any more efficient than your > first version with `walk (filter $ not . isThe)`. > Actually, I'd be interested in seeing benchmarks with these > approaches, if efficiency matters enough to you to research > this more. > > Another possibility is to walk with a function like > > killThe :: Inline -> Inline > killThe (Str "the") = Str "" -- or: Span ("",[],[]) [] > killThe x = x > > This should be more efficient than the list versions, though > again it would be interesting to see benchmarks. > > > > Ilia Zaihcuk <zoi...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes: > > > Hi all, > > > > Using pandoc-types' Walkable > > < > https://hackage.haskell.org/package/pandoc-types-1.22/docs/Text-Pandoc-Walk.html#t:Walkable> > > > class, what is the best-performing/most idiomatic way to filter out > certain > > elements from a document? > > > > Say I wanted to remove all occurrences of the word "the" from a Pandoc. > The > > best implementation I've been able to write for this is > > > > isThe :: Inline -> Bool > > isThe (Str "the") = True > > isThe _ = False > > > > removeThe :: Pandoc -> Pandoc > > removeThe = walk (filter $ not . isThe) > > > > Is this right? Does the use of filter here not mean an additional O(n) > > traversal happens on every list of Inlines? > > > > I feel like a better solution for this would be using > > > > filterThe :: Inline -> [Inline] > > filterThe (Str "the") = [] > > filterThe x = [x] > > > > or something similar, but of course > > > > removeThe = walk filterThe > > > > does not typecheck. We need a -> a, meaning [Inline] -> [Inline] here. > > > > > > The same question actually applies to any transformation which "changes > the > > number of elements": > > > > theFine :: Inline -> [Inline] > > theFine (Str "the") = [Str "the", Space, Str "fine"] > > theFine i = [i] > > > > allIsFine :: Pandoc -> Pandoc > > allIsFine = walk $ concatMap theFine > > > > Best, > > > > Ilia > > > > P.S. Sorry if this has been asked before. I feel it must be a common > issue, > > but all I've been able to find is this thread > > <https://groups.google.com/g/pandoc-discuss/c/idlbnOk1ooE/m/_QVvaRprHVcJ> > > > with its links, which are 7 years old now and seem to be calling for > > changes in pandoc. Everything else suggests stepping outside Haskell. > > > > -- > > You received this message because you are subscribed to the Google > Groups "pandoc-discuss" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/915213fc-e4c4-480b-a6a2-fd3420777ddan%40googlegroups.com > . > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/1473b62b-4b33-49b5-ac6d-52a5571d8068n%40googlegroups.com. [-- Attachment #1.2: Type: text/html, Size: 7416 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <1473b62b-4b33-49b5-ac6d-52a5571d8068n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: Removing parts of the document with [walk] in Haskell [not found] ` <1473b62b-4b33-49b5-ac6d-52a5571d8068n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2021-09-05 15:42 ` Gwern Branwen 0 siblings, 0 replies; 5+ messages in thread From: Gwern Branwen @ 2021-09-05 15:42 UTC (permalink / raw) To: pandoc-discuss On Sun, Sep 5, 2021 at 11:09 AM Ilia Zaihcuk <zoickx-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > The trick with using Span is neat, although it does seem more like a workaround than a true solution - it is making additional changes to the AST after all. Can I be sure that they won't influence my output in unexpected ways? (e.g. some obscure issue like auto-hyphenation in LaTeX, page breaks, etc.) I only use HTML output, so I have no idea what other targets like LaTeX do, or if they even output spans at all. Using the span trick necessarily introduces spans around the rewritten text, but I think this is generally desirable. (The whole point of the smallcaps pass is to add marked-up spans for the CSS to apply a smallcaps variant to, and the spans allow me to suppress redundant definitions of terms when doing link rewrites.) The spans might change the necessary CSS for something or other, but I haven't run into any edge-cases there. If the span wrappers worry you, you can do an additional pass to change them, but you are going to run into the basic 'expression problem' https://en.wikipedia.org/wiki/Expression_problem regardless: a strong static FP language like Haskell makes it extremely easy and downright trivial to write a function like 'Inline -> Inline' to plug into 'Pandoc -> Pandoc', but the cost of this is that the datatype 'Inline' or 'Pandoc' cannot be changed easily. (But at least doing a cleanup pass scales well in terms of effort: define all the rewrites you need with the Span trick easily, and then write a single painful cleanup pass.) > This is scary! Correctness is certainly more important than speed. Do you have a concrete example where this issue occurs? If @jgm (the original author of pandoc!) is using this method and there are cases in which it behaves unexpectedly, this sounds worthy of an issue and/or a pull request on GitHub. I think the recursion patterns tend to work worst when you are adding in new nodes so that it adds a new node, then descends into it, applies the same rewrite adding new nodes, descends into them... Stuff like that. Not necessarily wrong, but subtle and requiring a lot of understanding to know which one to use. -- gwern https://www.gwern.net -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAMwO0gwipJYRLBdt%3DMOcn_%3Dc_xWu5q-9Y54BZ%2Bvr0SH7anAHaw%40mail.gmail.com. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-09-05 15:42 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-09-04 19:09 Removing parts of the document with [walk] in Haskell Ilia Zaihcuk [not found] ` <915213fc-e4c4-480b-a6a2-fd3420777ddan-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2021-09-04 19:26 ` Gwern Branwen 2021-09-05 5:23 ` John MacFarlane [not found] ` <m2mtorlhoo.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> 2021-09-05 15:09 ` Ilia Zaihcuk [not found] ` <1473b62b-4b33-49b5-ac6d-52a5571d8068n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2021-09-05 15:42 ` Gwern Branwen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).