public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* Pandoc breaks table headers when converting HTML (exported from Confluence) to Github Flavored Markdown
@ 2023-07-14  6:03 'Michael Mell' via pandoc-discuss
       [not found] ` <e4b6b290-ab59-4ff6-83ac-47b017e033f5n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 2+ messages in thread
From: 'Michael Mell' via pandoc-discuss @ 2023-07-14  6:03 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 6410 bytes --]

I am trying to convert HTML pages from our Confluence Wiki to Github 
Flavored Markdown for the Github Wiki.

I want to remove all formatting to get a "vanilla" Markdown output without 
embedded HTML. I settled on this command for the moment:

```sh
pandoc failing_table_tidy_reduced.html -f html-native_divs-native_spans -t 
gfm-raw_html -o failing_table_tidy_reduced.md
```

**(The contents of `failing_table_tidy_reduced.html` are pasted below.)**

The Markdown output is OK for the most part, except that the table headers 
are systematically broken. I get this for the example file that is pasted 
below:

```md
|                                                |                         
                      |                                                     
                                                                            
                        |
|------------------------------------------------|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| Step 1: Select to open image as virtual stack. | Step 2: Select image 
folder and open dataset. | Step 3: View with opened image stack. Use the 
slider of in the phase contrast histogram (top) to adjust image saturation 
for better channel visibility. |
| ![](attachments/314948158/314950704.png)       | 
![](attachments/314948158/314950710.png)      | 
![](attachments/314948158/314950785.png)                                   
                                                                            
 |
```

Whereas I expect the text (ie. "Step N: ...") to be in the table header, 
like so:

```md
| Step 1: Select to open image as virtual stack. | Step 2: Select image 
folder and open dataset. | Step 3: View with opened image stack. Use the 
slider of in the phase contrast histogram (top) to adjust image saturation 
for better channel visibility. |
|------------------------------------------------|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| ![](attachments/314948158/314950704.png)       | 
![](attachments/314948158/314950710.png)      | 
![](attachments/314948158/314950785.png)                                   
                                                                            
 |
```

What am I doing wrong?

---
This is the content of `failing_table_tidy_reduced.html`:

```html
<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Linux version 5.6.0">
<title>Title</title>
<link rel="stylesheet" href="styles/site.css" type="text/css">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type='text/css'>
/*<![CDATA[*/
div.rbtoc1689000519714 {padding: 0px;}
div.rbtoc1689000519714 ul {margin-left: 0px;}
div.rbtoc1689000519714 li {margin-left: 0px;padding-left: 0px;}

/*]]>*/
</style>
</head>
<body class="theme-default aui-theme-default">
<div class="table-wrap">
<table class="wrapped relative-table confluenceTable" style=
"width: 48.0112%;">
<colgroup>
<col style="width: 27.3364%;">
<col style="width: 28.271%;">
<col style="width: 44.3925%;"></colgroup>
<tbody>
<tr>
<th class="confluenceTh">
<p>Step 1: Select to open image as virtual stack.</p>
</th>
<th class="confluenceTh">
<p>Step 2: Select image folder and open dataset.</p>
</th>
<th class="confluenceTh">Step 3: View with opened image stack. Use
the slider of in the phase contrast histogram (top) to adjust image
saturation for better channel visibility.</th>
</tr>
<tr>
<td colspan="1" class="confluenceTd">
<div class="content-wrapper">
<p><span class=
"confluence-embedded-file-wrapper confluence-embedded-manual-size"><img 
class="confluence-embedded-image confluence-thumbnail"
draggable="false" height="250" src=
"attachments/314948158/314950704.png" data-image-src=
"attachments/314948158/314950704.png"
data-unresolved-comment-count="0" data-linked-resource-id=
"314950704" data-linked-resource-version="1"
data-linked-resource-type="attachment"
data-linked-resource-default-alias="image2022-4-26_15-0-46.png"
data-base-url="https://my.url.com"
data-linked-resource-content-type="image/png"
data-linked-resource-container-id="314948158"
data-linked-resource-container-version="61" alt=""></span></p>
</div>
</td>
<td colspan="1" class="confluenceTd">
<div class="content-wrapper">
<p><span class=
"confluence-embedded-file-wrapper confluence-embedded-manual-size"><img 
class="confluence-embedded-image confluence-thumbnail"
draggable="false" height="250" src=
"attachments/314948158/314950710.png" data-image-src=
"attachments/314948158/314950710.png"
data-unresolved-comment-count="0" data-linked-resource-id=
"314950710" data-linked-resource-version="1"
data-linked-resource-type="attachment"
data-linked-resource-default-alias="image2022-4-26_15-1-20.png"
data-base-url="https://my.url.com"
data-linked-resource-content-type="image/png"
data-linked-resource-container-id="314948158"
data-linked-resource-container-version="61" alt=""></span></p>
</div>
</td>
<td colspan="1" class="confluenceTd">
<div class="content-wrapper">
<p><span class=
"confluence-embedded-file-wrapper confluence-embedded-manual-size"><img 
class="confluence-embedded-image"
draggable="false" height="250" src=
"attachments/314948158/314950785.png" data-image-src=
"attachments/314948158/314950785.png"
data-unresolved-comment-count="0" data-linked-resource-id=
"314950785" data-linked-resource-version="1"
data-linked-resource-type="attachment"
data-linked-resource-default-alias="image2022-4-26_15-12-47.png"
data-base-url="https://my.url.com"
data-linked-resource-content-type="image/png"
data-linked-resource-container-id="314948158"
data-linked-resource-container-version="61" alt=""></span></p>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
```

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/e4b6b290-ab59-4ff6-83ac-47b017e033f5n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 7964 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Pandoc breaks table headers when converting HTML (exported from Confluence) to Github Flavored Markdown
       [not found] ` <e4b6b290-ab59-4ff6-83ac-47b017e033f5n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2023-07-14 18:39   ` John MacFarlane
  0 siblings, 0 replies; 2+ messages in thread
From: John MacFarlane @ 2023-07-14 18:39 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 7648 bytes --]

I'm guessing the issue is that the heading for your table is inside the tbody element, rather than thead.

> On Jul 13, 2023, at 11:03 PM, 'Michael Mell' via pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> wrote:
> 
> I am trying to convert HTML pages from our Confluence Wiki to Github Flavored Markdown for the Github Wiki.
> 
> I want to remove all formatting to get a "vanilla" Markdown output without embedded HTML. I settled on this command for the moment:
> 
> ```sh
> pandoc failing_table_tidy_reduced.html -f html-native_divs-native_spans -t gfm-raw_html -o failing_table_tidy_reduced.md
> ```
> 
> **(The contents of `failing_table_tidy_reduced.html` are pasted below.)**
> 
> The Markdown output is OK for the most part, except that the table headers are systematically broken. I get this for the example file that is pasted below:
> 
> ```md
> |                                                |                                               |                                                                                                                                                         |
> |------------------------------------------------|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
> | Step 1: Select to open image as virtual stack. | Step 2: Select image folder and open dataset. | Step 3: View with opened image stack. Use the slider of in the phase contrast histogram (top) to adjust image saturation for better channel visibility. |
> | ![](attachments/314948158/314950704.png)       | ![](attachments/314948158/314950710.png)      | ![](attachments/314948158/314950785.png)                                                                                                                |
> ```
> 
> Whereas I expect the text (ie. "Step N: ...") to be in the table header, like so:
> 
> ```md
> | Step 1: Select to open image as virtual stack. | Step 2: Select image folder and open dataset. | Step 3: View with opened image stack. Use the slider of in the phase contrast histogram (top) to adjust image saturation for better channel visibility. |
> |------------------------------------------------|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
> | ![](attachments/314948158/314950704.png)       | ![](attachments/314948158/314950710.png)      | ![](attachments/314948158/314950785.png)                                                                                                                |
> ```
> 
> What am I doing wrong?
> 
> ---
> This is the content of `failing_table_tidy_reduced.html`:
> 
> ```html
> <!DOCTYPE html>
> <html>
> <head>
> <meta name="generator" content=
> "HTML Tidy for HTML5 for Linux version 5.6.0">
> <title>Title</title>
> <link rel="stylesheet" href="styles/site.css" type="text/css">
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
> <style type='text/css'>
> /*<![CDATA[*/
> div.rbtoc1689000519714 {padding: 0px;}
> div.rbtoc1689000519714 ul {margin-left: 0px;}
> div.rbtoc1689000519714 li {margin-left: 0px;padding-left: 0px;}
> 
> /*]]>*/
> </style>
> </head>
> <body class="theme-default aui-theme-default">
> <div class="table-wrap">
> <table class="wrapped relative-table confluenceTable" style=
> "width: 48.0112%;">
> <colgroup>
> <col style="width: 27.3364%;">
> <col style="width: 28.271%;">
> <col style="width: 44.3925%;"></colgroup>
> <tbody>
> <tr>
> <th class="confluenceTh">
> <p>Step 1: Select to open image as virtual stack.</p>
> </th>
> <th class="confluenceTh">
> <p>Step 2: Select image folder and open dataset.</p>
> </th>
> <th class="confluenceTh">Step 3: View with opened image stack. Use
> the slider of in the phase contrast histogram (top) to adjust image
> saturation for better channel visibility.</th>
> </tr>
> <tr>
> <td colspan="1" class="confluenceTd">
> <div class="content-wrapper">
> <p><span class=
> "confluence-embedded-file-wrapper confluence-embedded-manual-size"><img class="confluence-embedded-image confluence-thumbnail"
> draggable="false" height="250" src=
> "attachments/314948158/314950704.png" data-image-src=
> "attachments/314948158/314950704.png"
> data-unresolved-comment-count="0" data-linked-resource-id=
> "314950704" data-linked-resource-version="1"
> data-linked-resource-type="attachment"
> data-linked-resource-default-alias="image2022-4-26_15-0-46.png"
> data-base-url="https://my.url.com"
> data-linked-resource-content-type="image/png"
> data-linked-resource-container-id="314948158"
> data-linked-resource-container-version="61" alt=""></span></p>
> </div>
> </td>
> <td colspan="1" class="confluenceTd">
> <div class="content-wrapper">
> <p><span class=
> "confluence-embedded-file-wrapper confluence-embedded-manual-size"><img class="confluence-embedded-image confluence-thumbnail"
> draggable="false" height="250" src=
> "attachments/314948158/314950710.png" data-image-src=
> "attachments/314948158/314950710.png"
> data-unresolved-comment-count="0" data-linked-resource-id=
> "314950710" data-linked-resource-version="1"
> data-linked-resource-type="attachment"
> data-linked-resource-default-alias="image2022-4-26_15-1-20.png"
> data-base-url="https://my.url.com"
> data-linked-resource-content-type="image/png"
> data-linked-resource-container-id="314948158"
> data-linked-resource-container-version="61" alt=""></span></p>
> </div>
> </td>
> <td colspan="1" class="confluenceTd">
> <div class="content-wrapper">
> <p><span class=
> "confluence-embedded-file-wrapper confluence-embedded-manual-size"><img class="confluence-embedded-image"
> draggable="false" height="250" src=
> "attachments/314948158/314950785.png" data-image-src=
> "attachments/314948158/314950785.png"
> data-unresolved-comment-count="0" data-linked-resource-id=
> "314950785" data-linked-resource-version="1"
> data-linked-resource-type="attachment"
> data-linked-resource-default-alias="image2022-4-26_15-12-47.png"
> data-base-url="https://my.url.com"
> data-linked-resource-content-type="image/png"
> data-linked-resource-container-id="314948158"
> data-linked-resource-container-version="61" alt=""></span></p>
> </div>
> </td>
> </tr>
> </tbody>
> </table>
> </div>
> </body>
> </html>
> ```
> 
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/e4b6b290-ab59-4ff6-83ac-47b017e033f5n%40googlegroups.com <https://groups.google.com/d/msgid/pandoc-discuss/e4b6b290-ab59-4ff6-83ac-47b017e033f5n%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/617D7B7C-C5B6-43D3-9789-5014701BF8AC%40gmail.com.

[-- Attachment #2: Type: text/html, Size: 9957 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-07-14 18:39 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-14  6:03 Pandoc breaks table headers when converting HTML (exported from Confluence) to Github Flavored Markdown 'Michael Mell' via pandoc-discuss
     [not found] ` <e4b6b290-ab59-4ff6-83ac-47b017e033f5n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2023-07-14 18:39   ` John MacFarlane

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).