public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* Is there a way to change the way Pandoc parses HTML inside of markdown documents?
@ 2021-08-16 21:43 pompez
       [not found] ` <aae29ca7-60ca-4349-af03-939f0ac503efn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: pompez @ 2021-08-16 21:43 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 2262 bytes --]


I'm starting out with Lua filters and apologize for this possibly already 
answered question. You can also read this question on StackOverflow 
<https://stackoverflow.com/questions/68809527/is-there-a-way-to-change-the-way-pandoc-parses-html-inside-of-markdown-documents>
.

I'm using Pandoc to convert markdown to HTML. My markdown files also 
contain some raw HTML. In the examples, I'll be using `<mark>` and `<u>`.

Let's say I want to change every `<mark>` to a `<u>` tag. We parse the 
input as HTML and look at the AST.

```
$ echo '<u>foo</u> & <mark>bar</mark>' | pandoc --from=html --to native
[Plain [Underline [Str "foo"],Space,Str "&",Space,Span ("", ["mark"],[]) 
[Str "bar"]]]
```

On this structure, we can use a simple filter which replaces `Span` 
elements representing the `<mark>` tag and replaces with `Underline` 
elements.

```
function Span(elem)
    if elem.classes[1]:gmatch('mark') then
        return pandoc.Underline(elem.content)
    end
end
```

```
[Plain [Underline [Str "foo"],Space,Str "&",Space,Underline [Str "bar"]]]
```

This is good. But if we parse the same input as markdown, we get a much 
less convenient structure.

```
$ echo '<u>foo</u> & <mark>bar</mark>' | pandoc --from=markdown+raw_html 
--to native
[Para [RawInline (Format "html") "<u>",Str "foo",RawInline (Format "html") 
"</u>",Space,Str "&",Space,RawInline (Format "html") "<mark>",Str 
"bar",RawInline (Format "html") "</mark>"]]
```

And if we had some additional criteria by which to replace `<mark>` with 
`<u>` (the content for example), we would have to identify the opening and 
closing `RawInline` elements.

I'm wondering if there is any good solutions to this problem? Is there a 
way to parse HTML in markdown just as HTML would be parsed otherwise? Or is 
there way to solve this in a Lua filter without writing some parsing code?

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/aae29ca7-60ca-4349-af03-939f0ac503efn%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 2902 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-08-24  8:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-16 21:43 Is there a way to change the way Pandoc parses HTML inside of markdown documents? pompez
     [not found] ` <aae29ca7-60ca-4349-af03-939f0ac503efn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-08-16 22:08   ` John MacFarlane
     [not found]     ` <yh480k1r6tt53d.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
2021-08-16 22:55       ` pompez
2021-08-17 10:37       ` William Lupton
     [not found]         ` <CAEe_xxj-kp22oToH4o5J54s16W4WzMkiaEicOy+TuqDZf5LP3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2021-08-17 11:24           ` Bastien DUMONT
2021-08-24  8:44           ` pompez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).