public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* bug: docx (containing table) to native and docx to markdown then to native is hugely different
@ 2016-12-07  7:06 Kolen Cheung
       [not found] ` <3c212e85-1e24-4fa2-817e-051e55f5821d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Kolen Cheung @ 2016-12-07  7:06 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1783 bytes --]



I have a doc with 2 tables. I converted it to docx (with Word) and then 
trying to use pandoc to convert it to md.

I got some strange error, and when looking at the native I found this:

   1. pandoc -t native others/Mirrors-Lens.docx -o 
   others/Mirrors-Lens.native is kind of normal 
   2. pandoc -t markdown others/Mirrors-Lens.docx | pandoc -f markdown -t 
   native -o others/Mirrors-Lens-round.native is hugely different from the 
   above 

One of the main difference is that both tables (originally multi-columned) 
collapsed to 1 column only in the later round-trip. (Another annoyance to 
me is I actually tried to use my pantable2csv filter to capture the table 
from the docx directly into csv but resulted in error.)

For the 2nd table, I’m guessing the existence of <, > causes problems. For 
the 1st I have no clue since the native looks fine to me.

But since I do not own the copyright of the file, I am not going to share 
this publicly (isn’t really important but just in case…). If anyone is 
interested in help solving the puzzle / debug pandoc, please give me your 
github account and I can open a private repository and invite you. Thanks!
​

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/3c212e85-1e24-4fa2-817e-051e55f5821d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 7304 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: bug: docx (containing table) to native and docx to markdown then to native is hugely different
       [not found] ` <3c212e85-1e24-4fa2-817e-051e55f5821d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2016-12-07  9:57   ` Kolen Cheung
  2016-12-07 10:23   ` Kolen Cheung
  1 sibling, 0 replies; 3+ messages in thread
From: Kolen Cheung @ 2016-12-07  9:57 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1527 bytes --]

I'm not sure if I've correctly identify the problem: the docx reader might 
treat the tables as having 1 header row only, while the table cell is 
empty, the structure is something like this:

```json
[Table [] [AlignDefault,AlignDefault,AlignDefault] [0.0,0.0,0.0]
 [[Para [Str "x",Space,Str "y"]]
 ,[Para [Strong [Emph [Str "a",Space,Str "b"]]]]
 ,[Para [Strong [Emph [Str "Math"]]]]]
 []]
```
The 2-4th row seems to be a header row, then the `[]` is the table body, 
which has no length, but should have a length of 3. Panflute asserts that 
was true, that explains the error I got from my filter.

And pandoc read this just fine, and is indeed output by pandoc's docx 
reader. On the other hand, pandoc's writers like markdown and html seem to 
parse this input incorrectly. Is it a valid pandoc AST?

And as a general rule, is it safe to assert that the align-list, 
width-list, header-list, and each of the row-list are all having the same 
length?

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/5c10d3f8-8448-4b5f-bfcb-f9c2a4897ec4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 2154 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: bug: docx (containing table) to native and docx to markdown then to native is hugely different
       [not found] ` <3c212e85-1e24-4fa2-817e-051e55f5821d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2016-12-07  9:57   ` Kolen Cheung
@ 2016-12-07 10:23   ` Kolen Cheung
  1 sibling, 0 replies; 3+ messages in thread
From: Kolen Cheung @ 2016-12-07 10:23 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 727 bytes --]



Forget about this, I finally narrowed down a MWE and submitted an issue in 
jgm/pandoc#3285 <https://github.com/jgm/pandoc/issues/3285>.
​

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/1383d064-db39-4cb4-a869-d83f3b41783a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 1809 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-12-07 10:23 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-07  7:06 bug: docx (containing table) to native and docx to markdown then to native is hugely different Kolen Cheung
     [not found] ` <3c212e85-1e24-4fa2-817e-051e55f5821d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2016-12-07  9:57   ` Kolen Cheung
2016-12-07 10:23   ` Kolen Cheung

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).