Is there a way to change the way Pandoc parses HTML inside of markdown documents?

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* Is there a way to change the way Pandoc parses HTML inside of markdown documents?
@ 2021-08-16 21:43 pompez
       [not found] ` <aae29ca7-60ca-4349-af03-939f0ac503efn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: pompez @ 2021-08-16 21:43 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1.1: Type: text/plain, Size: 2262 bytes --]

I'm starting out with Lua filters and apologize for this possibly already 
answered question. You can also read this question on StackOverflow 
<https://stackoverflow.com/questions/68809527/is-there-a-way-to-change-the-way-pandoc-parses-html-inside-of-markdown-documents>
.

I'm using Pandoc to convert markdown to HTML. My markdown files also 
contain some raw HTML. In the examples, I'll be using `<mark>` and `<u>`.

Let's say I want to change every `<mark>` to a `<u>` tag. We parse the 
input as HTML and look at the AST.

```
$ echo '<u>foo</u> & <mark>bar</mark>' | pandoc --from=html --to native
[Plain [Underline [Str "foo"],Space,Str "&",Space,Span ("", ["mark"],[]) 
[Str "bar"]]]
```

On this structure, we can use a simple filter which replaces `Span` 
elements representing the `<mark>` tag and replaces with `Underline` 
elements.

```
function Span(elem)
    if elem.classes[1]:gmatch('mark') then
        return pandoc.Underline(elem.content)
    end
end
```

```
[Plain [Underline [Str "foo"],Space,Str "&",Space,Underline [Str "bar"]]]
```

This is good. But if we parse the same input as markdown, we get a much 
less convenient structure.

```
$ echo '<u>foo</u> & <mark>bar</mark>' | pandoc --from=markdown+raw_html 
--to native
[Para [RawInline (Format "html") "<u>",Str "foo",RawInline (Format "html") 
"</u>",Space,Str "&",Space,RawInline (Format "html") "<mark>",Str 
"bar",RawInline (Format "html") "</mark>"]]
```

And if we had some additional criteria by which to replace `<mark>` with 
`<u>` (the content for example), we would have to identify the opening and 
closing `RawInline` elements.

I'm wondering if there is any good solutions to this problem? Is there a 
way to parse HTML in markdown just as HTML would be parsed otherwise? Or is 
there way to solve this in a Lua filter without writing some parsing code?

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/aae29ca7-60ca-4349-af03-939f0ac503efn%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 2902 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Is there a way to change the way Pandoc parses HTML inside of markdown documents?
       [not found] ` <aae29ca7-60ca-4349-af03-939f0ac503efn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2021-08-16 22:08   ` John MacFarlane
       [not found]     ` <yh480k1r6tt53d.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: John MacFarlane @ 2021-08-16 22:08 UTC (permalink / raw)
  To: pompez, pandoc-discuss


I'm afraid you'll have to write some parsing code...

pompez <martinsifrar11-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> I'm starting out with Lua filters and apologize for this possibly already 
> answered question. You can also read this question on StackOverflow 
> <https://stackoverflow.com/questions/68809527/is-there-a-way-to-change-the-way-pandoc-parses-html-inside-of-markdown-documents>
> .
>
> I'm using Pandoc to convert markdown to HTML. My markdown files also 
> contain some raw HTML. In the examples, I'll be using `<mark>` and `<u>`.
>
> Let's say I want to change every `<mark>` to a `<u>` tag. We parse the 
> input as HTML and look at the AST.
>
> ```
> $ echo '<u>foo</u> & <mark>bar</mark>' | pandoc --from=html --to native
> [Plain [Underline [Str "foo"],Space,Str "&",Space,Span ("", ["mark"],[]) 
> [Str "bar"]]]
> ```
>
> On this structure, we can use a simple filter which replaces `Span` 
> elements representing the `<mark>` tag and replaces with `Underline` 
> elements.
>
> ```
> function Span(elem)
>     if elem.classes[1]:gmatch('mark') then
>         return pandoc.Underline(elem.content)
>     end
> end
> ```
>
> ```
> [Plain [Underline [Str "foo"],Space,Str "&",Space,Underline [Str "bar"]]]
> ```
>
> This is good. But if we parse the same input as markdown, we get a much 
> less convenient structure.
>
> ```
> $ echo '<u>foo</u> & <mark>bar</mark>' | pandoc --from=markdown+raw_html 
> --to native
> [Para [RawInline (Format "html") "<u>",Str "foo",RawInline (Format "html") 
> "</u>",Space,Str "&",Space,RawInline (Format "html") "<mark>",Str 
> "bar",RawInline (Format "html") "</mark>"]]
> ```
>
> And if we had some additional criteria by which to replace `<mark>` with 
> `<u>` (the content for example), we would have to identify the opening and 
> closing `RawInline` elements.
>
> I'm wondering if there is any good solutions to this problem? Is there a 
> way to parse HTML in markdown just as HTML would be parsed otherwise? Or is 
> there way to solve this in a Lua filter without writing some parsing code?
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/aae29ca7-60ca-4349-af03-939f0ac503efn%40googlegroups.com.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Is there a way to change the way Pandoc parses HTML inside of markdown documents?
       [not found]     ` <yh480k1r6tt53d.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
@ 2021-08-16 22:55       ` pompez
  2021-08-17 10:37       ` William Lupton
  1 sibling, 0 replies; 6+ messages in thread
From: pompez @ 2021-08-16 22:55 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 3152 bytes --]

That's okay. Just wanted to know beforehand. Thanks.

On Tuesday, August 17, 2021 at 12:09:15 AM UTC+2 John MacFarlane wrote:

>
> I'm afraid you'll have to write some parsing code...
>
> pompez <martins...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>
> > I'm starting out with Lua filters and apologize for this possibly 
> already 
> > answered question. You can also read this question on StackOverflow 
> > <
> https://stackoverflow.com/questions/68809527/is-there-a-way-to-change-the-way-pandoc-parses-html-inside-of-markdown-documents
> >
> > .
> >
> > I'm using Pandoc to convert markdown to HTML. My markdown files also 
> > contain some raw HTML. In the examples, I'll be using `<mark>` and `<u>`.
> >
> > Let's say I want to change every `<mark>` to a `<u>` tag. We parse the 
> > input as HTML and look at the AST.
> >
> > ```
> > $ echo '<u>foo</u> & <mark>bar</mark>' | pandoc --from=html --to native
> > [Plain [Underline [Str "foo"],Space,Str "&",Space,Span ("", ["mark"],[]) 
> > [Str "bar"]]]
> > ```
> >
> > On this structure, we can use a simple filter which replaces `Span` 
> > elements representing the `<mark>` tag and replaces with `Underline` 
> > elements.
> >
> > ```
> > function Span(elem)
> > if elem.classes[1]:gmatch('mark') then
> > return pandoc.Underline(elem.content)
> > end
> > end
> > ```
> >
> > ```
> > [Plain [Underline [Str "foo"],Space,Str "&",Space,Underline [Str "bar"]]]
> > ```
> >
> > This is good. But if we parse the same input as markdown, we get a much 
> > less convenient structure.
> >
> > ```
> > $ echo '<u>foo</u> & <mark>bar</mark>' | pandoc --from=markdown+raw_html 
> > --to native
> > [Para [RawInline (Format "html") "<u>",Str "foo",RawInline (Format 
> "html") 
> > "</u>",Space,Str "&",Space,RawInline (Format "html") "<mark>",Str 
> > "bar",RawInline (Format "html") "</mark>"]]
> > ```
> >
> > And if we had some additional criteria by which to replace `<mark>` with 
> > `<u>` (the content for example), we would have to identify the opening 
> and 
> > closing `RawInline` elements.
> >
> > I'm wondering if there is any good solutions to this problem? Is there a 
> > way to parse HTML in markdown just as HTML would be parsed otherwise? Or 
> is 
> > there way to solve this in a Lua filter without writing some parsing 
> code?
> >
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/aae29ca7-60ca-4349-af03-939f0ac503efn%40googlegroups.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/24f0fa08-cbd8-478c-9db0-d99ed2901148n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 5266 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Is there a way to change the way Pandoc parses HTML inside of markdown documents?
       [not found]     ` <yh480k1r6tt53d.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
  2021-08-16 22:55       ` pompez
@ 2021-08-17 10:37       ` William Lupton
       [not found]         ` <CAEe_xxj-kp22oToH4o5J54s16W4WzMkiaEicOy+TuqDZf5LP3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 6+ messages in thread
From: William Lupton @ 2021-08-17 10:37 UTC (permalink / raw)
  To: pandoc-discuss; +Cc: pompez

[-- Attachment #1: Type: text/plain, Size: 3647 bytes --]

Could pandoc.read(markup, "html")
<https://pandoc.org/lua-filters.html#pandoc.read> help?

On Mon, 16 Aug 2021 at 23:09, John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> wrote:

>
> I'm afraid you'll have to write some parsing code...
>
> pompez <martinsifrar11-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>
> > I'm starting out with Lua filters and apologize for this possibly
> already
> > answered question. You can also read this question on StackOverflow
> > <
> https://stackoverflow.com/questions/68809527/is-there-a-way-to-change-the-way-pandoc-parses-html-inside-of-markdown-documents
> >
> > .
> >
> > I'm using Pandoc to convert markdown to HTML. My markdown files also
> > contain some raw HTML. In the examples, I'll be using `<mark>` and `<u>`.
> >
> > Let's say I want to change every `<mark>` to a `<u>` tag. We parse the
> > input as HTML and look at the AST.
> >
> > ```
> > $ echo '<u>foo</u> & <mark>bar</mark>' | pandoc --from=html --to native
> > [Plain [Underline [Str "foo"],Space,Str "&",Space,Span ("", ["mark"],[])
> > [Str "bar"]]]
> > ```
> >
> > On this structure, we can use a simple filter which replaces `Span`
> > elements representing the `<mark>` tag and replaces with `Underline`
> > elements.
> >
> > ```
> > function Span(elem)
> >     if elem.classes[1]:gmatch('mark') then
> >         return pandoc.Underline(elem.content)
> >     end
> > end
> > ```
> >
> > ```
> > [Plain [Underline [Str "foo"],Space,Str "&",Space,Underline [Str "bar"]]]
> > ```
> >
> > This is good. But if we parse the same input as markdown, we get a much
> > less convenient structure.
> >
> > ```
> > $ echo '<u>foo</u> & <mark>bar</mark>' | pandoc --from=markdown+raw_html
> > --to native
> > [Para [RawInline (Format "html") "<u>",Str "foo",RawInline (Format
> "html")
> > "</u>",Space,Str "&",Space,RawInline (Format "html") "<mark>",Str
> > "bar",RawInline (Format "html") "</mark>"]]
> > ```
> >
> > And if we had some additional criteria by which to replace `<mark>` with
> > `<u>` (the content for example), we would have to identify the opening
> and
> > closing `RawInline` elements.
> >
> > I'm wondering if there is any good solutions to this problem? Is there a
> > way to parse HTML in markdown just as HTML would be parsed otherwise? Or
> is
> > there way to solve this in a Lua filter without writing some parsing
> code?
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/aae29ca7-60ca-4349-af03-939f0ac503efn%40googlegroups.com
> .
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/yh480k1r6tt53d.fsf%40johnmacfarlane.net
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAEe_xxj-kp22oToH4o5J54s16W4WzMkiaEicOy%2BTuqDZf5LP3g%40mail.gmail.com.

[-- Attachment #2: Type: text/html, Size: 5812 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Is there a way to change the way Pandoc parses HTML inside of markdown documents?
       [not found]         ` <CAEe_xxj-kp22oToH4o5J54s16W4WzMkiaEicOy+TuqDZf5LP3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2021-08-17 11:24           ` Bastien DUMONT
  2021-08-24  8:44           ` pompez
  1 sibling, 0 replies; 6+ messages in thread
From: Bastien DUMONT @ 2021-08-17 11:24 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

> On this structure, we can use a simple filter which replaces `Span`
> elements representing the `<mark>` tag and replaces with `Underline`
> elements.
>
> ```
> function Span(elem)
>     if elem.classes[1]:gmatch('mark') then
>         return pandoc.Underline(elem.content)
>     end
> end

To apply the same code on a Markdown input file, you can use inline spans like this :
`[foo]{.underline} & [bar]{.mark}`.
                                    

Le Tuesday 17 August 2021 à 11:37:21AM, William Lupton a écrit :
> Could [1]pandoc.read(markup, "html") help?
> 
> On Mon, 16 Aug 2021 at 23:09, John MacFarlane <[2]jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> wrote:
> 
> 
>     I'm afraid you'll have to write some parsing code...
> 
>     pompez <[3]martinsifrar11-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
> 
>     > I'm starting out with Lua filters and apologize for this possibly already
>     > answered question. You can also read this question on StackOverflow
>     > <[4]https://stackoverflow.com/questions/68809527/
>     is-there-a-way-to-change-the-way-pandoc-parses-html-inside-of-markdown-documents
>     >
>     > .
>     >
>     > I'm using Pandoc to convert markdown to HTML. My markdown files also
>     > contain some raw HTML. In the examples, I'll be using `<mark>` and `<u>`.
>     >
>     > Let's say I want to change every `<mark>` to a `<u>` tag. We parse the
>     > input as HTML and look at the AST.
>     >
>     > ```
>     > $ echo '<u>foo</u> & <mark>bar</mark>' | pandoc --from=html --to native
>     > [Plain [Underline [Str "foo"],Space,Str "&",Space,Span ("", ["mark"],[])
>     > [Str "bar"]]]
>     > ```
>     >
>     > On this structure, we can use a simple filter which replaces `Span`
>     > elements representing the `<mark>` tag and replaces with `Underline`
>     > elements.
>     >
>     > ```
>     > function Span(elem)
>     >     if elem.classes[1]:gmatch('mark') then
>     >         return pandoc.Underline(elem.content)
>     >     end
>     > end
>     > ```
>     >
>     > ```
>     > [Plain [Underline [Str "foo"],Space,Str "&",Space,Underline [Str "bar"]]]
>     > ```
>     >
>     > This is good. But if we parse the same input as markdown, we get a much
>     > less convenient structure.
>     >
>     > ```
>     > $ echo '<u>foo</u> & <mark>bar</mark>' | pandoc --from=markdown+raw_html
>     > --to native
>     > [Para [RawInline (Format "html") "<u>",Str "foo",RawInline (Format
>     "html")
>     > "</u>",Space,Str "&",Space,RawInline (Format "html") "<mark>",Str
>     > "bar",RawInline (Format "html") "</mark>"]]
>     > ```
>     >
>     > And if we had some additional criteria by which to replace `<mark>` with
>     > `<u>` (the content for example), we would have to identify the opening
>     and
>     > closing `RawInline` elements.
>     >
>     > I'm wondering if there is any good solutions to this problem? Is there a
>     > way to parse HTML in markdown just as HTML would be parsed otherwise? Or
>     is
>     > there way to solve this in a Lua filter without writing some parsing
>     code?
>     >
>     > --
>     > You received this message because you are subscribed to the Google Groups
>     "pandoc-discuss" group.
>     > To unsubscribe from this group and stop receiving emails from it, send an
>     email to [5]pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>     > To view this discussion on the web visit [6]https://groups.google.com/d/
>     msgid/pandoc-discuss/
>     aae29ca7-60ca-4349-af03-939f0ac503efn%40googlegroups.com.
> 
>     --
>     You received this message because you are subscribed to the Google Groups
>     "pandoc-discuss" group.
>     To unsubscribe from this group and stop receiving emails from it, send an
>     email to [7]pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>     To view this discussion on the web visit [8]https://groups.google.com/d/
>     msgid/pandoc-discuss/yh480k1r6tt53d.fsf%40johnmacfarlane.net.
> 
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to [9]pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit [10]https://groups.google.com/d/msgid/
> pandoc-discuss/
> CAEe_xxj-kp22oToH4o5J54s16W4WzMkiaEicOy%2BTuqDZf5LP3g%40mail.gmail.com.
> 
> References:
> 
> [1] https://pandoc.org/lua-filters.html#pandoc.read
> [2] mailto:jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org
> [3] mailto:martinsifrar11-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
> [4] https://stackoverflow.com/questions/68809527/is-there-a-way-to-change-the-way-pandoc-parses-html-inside-of-markdown-documents
> [5] mailto:pandoc-discuss%2Bunsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
> [6] https://groups.google.com/d/msgid/pandoc-discuss/aae29ca7-60ca-4349-af03-939f0ac503efn%40googlegroups.com
> [7] mailto:pandoc-discuss%2Bunsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
> [8] https://groups.google.com/d/msgid/pandoc-discuss/yh480k1r6tt53d.fsf%40johnmacfarlane.net
> [9] mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
> [10] https://groups.google.com/d/msgid/pandoc-discuss/CAEe_xxj-kp22oToH4o5J54s16W4WzMkiaEicOy%2BTuqDZf5LP3g%40mail.gmail.com?utm_medium=email&utm_source=footer

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/YRuccFhI3anHPRPc%40localhost.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Is there a way to change the way Pandoc parses HTML inside of markdown documents?
       [not found]         ` <CAEe_xxj-kp22oToH4o5J54s16W4WzMkiaEicOy+TuqDZf5LP3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2021-08-17 11:24           ` Bastien DUMONT
@ 2021-08-24  8:44           ` pompez
  1 sibling, 0 replies; 6+ messages in thread
From: pompez @ 2021-08-24  8:44 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 3928 bytes --]

Sorry for the late reply. In my case, I'd still like to recognize the 
contents inside the block.

On Tuesday, August 17, 2021 at 12:37:37 PM UTC+2 William Lupton wrote:

> Could pandoc.read(markup, "html") 
> <https://pandoc.org/lua-filters.html#pandoc.read> help?
>
> On Mon, 16 Aug 2021 at 23:09, John MacFarlane <j...-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> wrote:
>
>>
>> I'm afraid you'll have to write some parsing code...
>>
>> pompez <martins...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>>
>> > I'm starting out with Lua filters and apologize for this possibly 
>> already 
>> > answered question. You can also read this question on StackOverflow 
>> > <
>> https://stackoverflow.com/questions/68809527/is-there-a-way-to-change-the-way-pandoc-parses-html-inside-of-markdown-documents
>> >
>> > .
>> >
>> > I'm using Pandoc to convert markdown to HTML. My markdown files also 
>> > contain some raw HTML. In the examples, I'll be using `<mark>` and 
>> `<u>`.
>> >
>> > Let's say I want to change every `<mark>` to a `<u>` tag. We parse the 
>> > input as HTML and look at the AST.
>> >
>> > ```
>> > $ echo '<u>foo</u> & <mark>bar</mark>' | pandoc --from=html --to native
>> > [Plain [Underline [Str "foo"],Space,Str "&",Space,Span ("", 
>> ["mark"],[]) 
>> > [Str "bar"]]]
>> > ```
>> >
>> > On this structure, we can use a simple filter which replaces `Span` 
>> > elements representing the `<mark>` tag and replaces with `Underline` 
>> > elements.
>> >
>> > ```
>> > function Span(elem)
>> >     if elem.classes[1]:gmatch('mark') then
>> >         return pandoc.Underline(elem.content)
>> >     end
>> > end
>> > ```
>> >
>> > ```
>> > [Plain [Underline [Str "foo"],Space,Str "&",Space,Underline [Str 
>> "bar"]]]
>> > ```
>> >
>> > This is good. But if we parse the same input as markdown, we get a much 
>> > less convenient structure.
>> >
>> > ```
>> > $ echo '<u>foo</u> & <mark>bar</mark>' | pandoc 
>> --from=markdown+raw_html 
>> > --to native
>> > [Para [RawInline (Format "html") "<u>",Str "foo",RawInline (Format 
>> "html") 
>> > "</u>",Space,Str "&",Space,RawInline (Format "html") "<mark>",Str 
>> > "bar",RawInline (Format "html") "</mark>"]]
>> > ```
>> >
>> > And if we had some additional criteria by which to replace `<mark>` 
>> with 
>> > `<u>` (the content for example), we would have to identify the opening 
>> and 
>> > closing `RawInline` elements.
>> >
>> > I'm wondering if there is any good solutions to this problem? Is there 
>> a 
>> > way to parse HTML in markdown just as HTML would be parsed otherwise? 
>> Or is 
>> > there way to solve this in a Lua filter without writing some parsing 
>> code?
>> >
>> > -- 
>> > You received this message because you are subscribed to the Google 
>> Groups "pandoc-discuss" group.
>> > To unsubscribe from this group and stop receiving emails from it, send 
>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> > To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/pandoc-discuss/aae29ca7-60ca-4349-af03-939f0ac503efn%40googlegroups.com
>> .
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "pandoc-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/pandoc-discuss/yh480k1r6tt53d.fsf%40johnmacfarlane.net
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/411e9a84-5981-4bd8-b905-914a66d1dc3fn%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 7225 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-08-24  8:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-16 21:43 Is there a way to change the way Pandoc parses HTML inside of markdown documents? pompez
     [not found] ` <aae29ca7-60ca-4349-af03-939f0ac503efn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-08-16 22:08   ` John MacFarlane
     [not found]     ` <yh480k1r6tt53d.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
2021-08-16 22:55       ` pompez
2021-08-17 10:37       ` William Lupton
     [not found]         ` <CAEe_xxj-kp22oToH4o5J54s16W4WzMkiaEicOy+TuqDZf5LP3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2021-08-17 11:24           ` Bastien DUMONT
2021-08-24  8:44           ` pompez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).