public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* Why does GFM to markdown not convert HTML?
@ 2021-10-07 13:31 Dominik Wujastyk
       [not found] ` <eca62f3a-d4e3-4459-830c-ca4a3de2d125n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 2+ messages in thread
From: Dominik Wujastyk @ 2021-10-07 13:31 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1883 bytes --]

Using  
pandoc -v
pandoc 2.14.2
Compiled with pandoc-types 1.22, texmath 0.12.3.1, skylighting 0.11,
citeproc 0.5, ipynb 0.1.0.1

Gfm input example:

# NAK 1-1079

<h2>
  Chapter-wise concordance of folios
</h2> 
<h3> 
  Prepared by Dominik Wujastyk (DW) and Andrey Klebanov (AK) 
</h3>

Note that this MS (a single physical object kept at the __NAK__ under the 
accession number __1-1079__) 
was microfilmed twice, as **A 45-5 (on 16.10.1970)** and **A 1267-11 (on 
16.11.1987)**. Digital copies 
of both microfilms are available to us.

```

Command:  

pandoc -f gfm -t commonmark -o outfile.md infile.gfm

Commonmark output:

# NAK 1-1079

<h2>
  Chapter-wise concordance of folios
</h2> 
<h3> 
  Prepared by Dominik Wujastyk (DW) and Andrey Klebanov (AK) 
</h3>

Note that this MS (a single physical object kept at the **NAK** under
the accession number **1-1079**) was microfilmed twice, as **A 45-5 (on
16.10.1970)** and **A 1267-11 (on 16.11.1987)**. Digital copies of both
microfilms are available to us.


I was expecting that this command would turn the HTML codes in the gfm file 
into commonmark Markdown.  But it didn't.  Am I doing something silly?  
Have I failed to understand what commonmark is?  The HTML-coded text does 
render in Github and editors like Typora.  So it seems wrong to treat them 
as raw blocks.

Furthermore, a markdown-encoded table in the gfm document is converted to 
an HTML-encoded one.  Why?  This seems counterintuitive to me. 

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/eca62f3a-d4e3-4459-830c-ca4a3de2d125n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 2680 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Why does GFM to markdown not convert HTML?
       [not found] ` <eca62f3a-d4e3-4459-830c-ca4a3de2d125n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2021-10-08 11:34   ` BPJ
  0 siblings, 0 replies; 2+ messages in thread
From: BPJ @ 2021-10-08 11:34 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 3544 bytes --]

In a sense you are in luck because the two HTML headings are parsed as a
single HTML raw block, which is what pandoc normally does with embedded
block level HTML content in (any kind of?) Markdown, so you could have a
Lua filter parse them from HTML into native elements and replace them with
those native elements like this

``````lua
function RawBlock (raw)
  if 'html' == raw.format then
    local html = raw.text
    local doc = pandoc.read(html, 'html')
    if doc then return doc.blocks end
  end
  return nil
end
``````

https://pandoc.org/lua-filters.html

https://pandoc.org/lua-filters.html#pandoc.read

While this does not guarantee that you will not get back any raw HTML,
since some HTML might be unrepresentable as native elements you will most
probably get back native elements which may or may not contain some raw
elements. In this case the success rate will be 100%.

HTH,

/bpj

Den tors 7 okt. 2021 15:32Dominik Wujastyk <wujastyk-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> skrev:

> Using
> pandoc -v
> pandoc 2.14.2
> Compiled with pandoc-types 1.22, texmath 0.12.3.1, skylighting 0.11,
> citeproc 0.5, ipynb 0.1.0.1
>
> Gfm input example:
>
> # NAK 1-1079
>
> <h2>
>   Chapter-wise concordance of folios
> </h2>
> <h3>
>   Prepared by Dominik Wujastyk (DW) and Andrey Klebanov (AK)
> </h3>
>
> Note that this MS (a single physical object kept at the __NAK__ under the
> accession number __1-1079__)
> was microfilmed twice, as **A 45-5 (on 16.10.1970)** and **A 1267-11 (on
> 16.11.1987)**. Digital copies
> of both microfilms are available to us.
>
> ```
>
> Command:
>
> pandoc -f gfm -t commonmark -o outfile.md infile.gfm
>
> Commonmark output:
>
> # NAK 1-1079
>
> <h2>
>   Chapter-wise concordance of folios
> </h2>
> <h3>
>   Prepared by Dominik Wujastyk (DW) and Andrey Klebanov (AK)
> </h3>
>
> Note that this MS (a single physical object kept at the **NAK** under
> the accession number **1-1079**) was microfilmed twice, as **A 45-5 (on
> 16.10.1970)** and **A 1267-11 (on 16.11.1987)**. Digital copies of both
> microfilms are available to us.
>
>
> I was expecting that this command would turn the HTML codes in the gfm
> file into commonmark Markdown.  But it didn't.  Am I doing something
> silly?  Have I failed to understand what commonmark is?  The HTML-coded
> text does render in Github and editors like Typora.  So it seems wrong to
> treat them as raw blocks.
>
> Furthermore, a markdown-encoded table in the gfm document is converted to
> an HTML-encoded one.  Why?  This seems counterintuitive to me.
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/eca62f3a-d4e3-4459-830c-ca4a3de2d125n%40googlegroups.com
> <https://groups.google.com/d/msgid/pandoc-discuss/eca62f3a-d4e3-4459-830c-ca4a3de2d125n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhCb0_HNVMuZ0S0vOpw-RBmcb3TvV9QHYjHLvEPyRwnqqQ%40mail.gmail.com.

[-- Attachment #2: Type: text/html, Size: 5288 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-10-08 11:34 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-07 13:31 Why does GFM to markdown not convert HTML? Dominik Wujastyk
     [not found] ` <eca62f3a-d4e3-4459-830c-ca4a3de2d125n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-10-08 11:34   ` BPJ

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).