public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* md-docx-md rountripping not working
@ 2018-12-23 21:55 Denis Maier
       [not found] ` <de0093bf-6f3a-45cc-be5b-02764dd126bd-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Denis Maier @ 2018-12-23 21:55 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 2178 bytes --]

I have tested if I can convert a docx produced with pandoc back to markdown 
after making a few changes in the docx. Normal paragraphs, blockquotes, 
headings and footnotes work fine. However, author, title, and date end up 
as normal paragraphs in the resulting markdown file, whereas the original 
source had a yaml metadata block.

I have this file (test.md) and use `pandoc test.md -o test.docx`

```
---
author: Author
title: Test
date: Dezember 2018
---

Heading
=======

Test Test Test
```

I can then convert the resulting docx to pandoc's native format:

```
Pandoc (Meta {unMeta = fromList [("author",MetaInlines [Str 
"Author"]),("date",MetaInlines [Str "Dezember",Space,Str 
"2018"]),("title",MetaInlines [Str "Test"])]})
[Header 1 ("heading",[],[]) [Str "Heading"]
,Div ("",[],[("custom-style","FirstParagraph")])
 [Para [Str "Test",Space,Str "Test",Space,Str "Test"]]]
```

Now, after making one small edit converting this to pandoc's native format 
(`pandoc test.docx -f docx+styles -t native) gives me:

```
Pandoc (Meta {unMeta = fromList []})
[Div ("",[],[("custom-style","Titel")])
 [Para [Str "Test"]]
,Div ("",[],[("custom-style","Author")])
 [Para [Str "Author"]]
,Div ("",[],[("custom-style","Datum")])
 [Para [Str "Dezember",Space,Str "2018"]]
,Header 1 ("heading",[],[]) [Str "Heading"]
,Div ("",[],[("custom-style","FirstParagraph")])
 [Para [Str "Test",Space,Str "Test",Space,Str "Test.",Space,Str 
"Another",Space,Str "Test."]]]
```

What is going wrong here? As you can see my change was trivial and occured 
not in the metadata. Nevertheless, we end up with different styles.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/de0093bf-6f3a-45cc-be5b-02764dd126bd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 3315 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: md-docx-md rountripping not working
       [not found] ` <de0093bf-6f3a-45cc-be5b-02764dd126bd-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2018-12-24 11:32   ` BP Jonsson
  2018-12-25 15:38     ` Denis Maier
  0 siblings, 1 reply; 4+ messages in thread
From: BP Jonsson @ 2018-12-24 11:32 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 3767 bytes --]

I think this is what is to be expected, since Pandoc inserts the title,
author and date as *text* elements at the top of the docx document, which
usually is what you want. It's the same with HTML. In an HTML template you
can arrange it for the title heading, author etc. to be marked with a class
so that you can use a filter to rearrange them as metadata when converting
back to markdown. I don't know if the docx writer applies any special
styles to these elements but if it does/did you could use the `+styles`
extension and catch them with a filter.

/bpj


Den sön 23 dec. 2018 22:55Denis Maier <maier.de-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> skrev:

> I have tested if I can convert a docx produced with pandoc back to
> markdown after making a few changes in the docx. Normal paragraphs,
> blockquotes, headings and footnotes work fine. However, author, title, and
> date end up as normal paragraphs in the resulting markdown file, whereas
> the original source had a yaml metadata block.
>
> I have this file (test.md) and use `pandoc test.md -o test.docx`
>
> ```
> ---
> author: Author
> title: Test
> date: Dezember 2018
> ---
>
> Heading
> =======
>
> Test Test Test
> ```
>
> I can then convert the resulting docx to pandoc's native format:
>
> ```
> Pandoc (Meta {unMeta = fromList [("author",MetaInlines [Str
> "Author"]),("date",MetaInlines [Str "Dezember",Space,Str
> "2018"]),("title",MetaInlines [Str "Test"])]})
> [Header 1 ("heading",[],[]) [Str "Heading"]
> ,Div ("",[],[("custom-style","FirstParagraph")])
>  [Para [Str "Test",Space,Str "Test",Space,Str "Test"]]]
> ```
>
> Now, after making one small edit converting this to pandoc's native format
> (`pandoc test.docx -f docx+styles -t native) gives me:
>
> ```
> Pandoc (Meta {unMeta = fromList []})
> [Div ("",[],[("custom-style","Titel")])
>  [Para [Str "Test"]]
> ,Div ("",[],[("custom-style","Author")])
>  [Para [Str "Author"]]
> ,Div ("",[],[("custom-style","Datum")])
>  [Para [Str "Dezember",Space,Str "2018"]]
> ,Header 1 ("heading",[],[]) [Str "Heading"]
> ,Div ("",[],[("custom-style","FirstParagraph")])
>  [Para [Str "Test",Space,Str "Test",Space,Str "Test.",Space,Str
> "Another",Space,Str "Test."]]]
> ```
>
> What is going wrong here? As you can see my change was trivial and occured
> not in the metadata. Nevertheless, we end up with different styles.
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/de0093bf-6f3a-45cc-be5b-02764dd126bd%40googlegroups.com
> <https://groups.google.com/d/msgid/pandoc-discuss/de0093bf-6f3a-45cc-be5b-02764dd126bd%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFC_yuT3isv%2B2MG%2BFk0swv5PB%2BhSHc39zFjum1qL3zyPui7N%3DQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 5431 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: md-docx-md rountripping not working
  2018-12-24 11:32   ` BP Jonsson
@ 2018-12-25 15:38     ` Denis Maier
       [not found]       ` <5157c829-b27c-4e8c-83b3-44e227c0a637-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Denis Maier @ 2018-12-25 15:38 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 4317 bytes --]

Well, the point is that I can convert an unmodified pandoc produced docx 
back to markdown, and the metadata will end up in an yaml metadata block. 
The problem comes up only after I modify the docx in Word. As you can see 
Word changes the names of the styles, perhaps because my default system 
language is German, I guess the problem is related to this.

Denis

Am Montag, 24. Dezember 2018 12:33:14 UTC+1 schrieb BP Jonsson:
>
> I think this is what is to be expected, since Pandoc inserts the title, 
> author and date as *text* elements at the top of the docx document, which 
> usually is what you want. It's the same with HTML. In an HTML template you 
> can arrange it for the title heading, author etc. to be marked with a class 
> so that you can use a filter to rearrange them as metadata when converting 
> back to markdown. I don't know if the docx writer applies any special 
> styles to these elements but if it does/did you could use the `+styles` 
> extension and catch them with a filter.
>
> /bpj
>
>
> Den sön 23 dec. 2018 22:55Denis Maier <maie...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org <javascript:>> 
> skrev:
>
>> I have tested if I can convert a docx produced with pandoc back to 
>> markdown after making a few changes in the docx. Normal paragraphs, 
>> blockquotes, headings and footnotes work fine. However, author, title, and 
>> date end up as normal paragraphs in the resulting markdown file, whereas 
>> the original source had a yaml metadata block.
>>
>> I have this file (test.md) and use `pandoc test.md -o test.docx`
>>
>> ```
>> ---
>> author: Author
>> title: Test
>> date: Dezember 2018
>> ---
>>
>> Heading
>> =======
>>
>> Test Test Test
>> ```
>>
>> I can then convert the resulting docx to pandoc's native format:
>>
>> ```
>> Pandoc (Meta {unMeta = fromList [("author",MetaInlines [Str 
>> "Author"]),("date",MetaInlines [Str "Dezember",Space,Str 
>> "2018"]),("title",MetaInlines [Str "Test"])]})
>> [Header 1 ("heading",[],[]) [Str "Heading"]
>> ,Div ("",[],[("custom-style","FirstParagraph")])
>>  [Para [Str "Test",Space,Str "Test",Space,Str "Test"]]]
>> ```
>>
>> Now, after making one small edit converting this to pandoc's native 
>> format (`pandoc test.docx -f docx+styles -t native) gives me:
>>
>> ```
>> Pandoc (Meta {unMeta = fromList []})
>> [Div ("",[],[("custom-style","Titel")])
>>  [Para [Str "Test"]]
>> ,Div ("",[],[("custom-style","Author")])
>>  [Para [Str "Author"]]
>> ,Div ("",[],[("custom-style","Datum")])
>>  [Para [Str "Dezember",Space,Str "2018"]]
>> ,Header 1 ("heading",[],[]) [Str "Heading"]
>> ,Div ("",[],[("custom-style","FirstParagraph")])
>>  [Para [Str "Test",Space,Str "Test",Space,Str "Test.",Space,Str 
>> "Another",Space,Str "Test."]]]
>> ```
>>
>> What is going wrong here? As you can see my change was trivial and 
>> occured not in the metadata. Nevertheless, we end up with different styles.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "pandoc-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>.
>> To post to this group, send email to pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org 
>> <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/pandoc-discuss/de0093bf-6f3a-45cc-be5b-02764dd126bd%40googlegroups.com 
>> <https://groups.google.com/d/msgid/pandoc-discuss/de0093bf-6f3a-45cc-be5b-02764dd126bd%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/5157c829-b27c-4e8c-83b3-44e227c0a637%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 6913 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: md-docx-md rountripping not working
       [not found]       ` <5157c829-b27c-4e8c-83b3-44e227c0a637-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2018-12-25 23:59         ` BP Jonsson
  0 siblings, 0 replies; 4+ messages in thread
From: BP Jonsson @ 2018-12-25 23:59 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw, Denis Maier

Den 2018-12-25 kl. 16:38, skrev Denis Maier:
> Well, the point is that I can convert an unmodified pandoc produced docx
> back to markdown, and the metadata will end up in an yaml metadata block.
> The problem comes up only after I modify the docx in Word. As you can see
> Word changes the names of the styles, perhaps because my default system
> language is German, I guess the problem is related to this.

I see.  I did some exploration and experimentation along these lines.

For better or worse I don't have any OS where Word runs available 
at the moment, but I created a docx file with author/title/date 
metadata fields and changed it in LibreOffice, including changing 
the document language to Swedish, and nothing similar happened, 
but I noted that the title/author/date paragraphs inserted by 
Pandoc have the named paragraph styles Title, Author and Date 
respectively, of which the last two are listed as custom styles 
and so probably are defined by Pandoc. When I change the paragraph 
style of any of those paragraphs the metadata fields disappear 
when I convert with `pandoc -so output.md input.docx`. (Note the 
-s (aka --standalone) option --- without it no metadata is 
included at all!)
When I changed the paragraph styles back the metadata fields 
reappeared. In fact Pandoc seems to honor any paragraphs using one 
of these paragraph styles but ignore the document properties, so 
quite possibly all you need to do is checking that those named 
paragraph styles are properly applied in your modified docx file.

/bpj

> 
> Denis
> 
> Am Montag, 24. Dezember 2018 12:33:14 UTC+1 schrieb BP Jonsson:
>>
>> I think this is what is to be expected, since Pandoc inserts the title,
>> author and date as *text* elements at the top of the docx document, which
>> usually is what you want. It's the same with HTML. In an HTML template you
>> can arrange it for the title heading, author etc. to be marked with a class
>> so that you can use a filter to rearrange them as metadata when converting
>> back to markdown. I don't know if the docx writer applies any special
>> styles to these elements but if it does/did you could use the `+styles`
>> extension and catch them with a filter.
>>
>> /bpj
>>
>>
>> Den sön 23 dec. 2018 22:55Denis Maier <maie...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org <javascript:>>
>> skrev:
>>
>>> I have tested if I can convert a docx produced with pandoc back to
>>> markdown after making a few changes in the docx. Normal paragraphs,
>>> blockquotes, headings and footnotes work fine. However, author, title, and
>>> date end up as normal paragraphs in the resulting markdown file, whereas
>>> the original source had a yaml metadata block.
>>>
>>> I have this file (test.md) and use `pandoc test.md -o test.docx`
>>>
>>> ```
>>> ---
>>> author: Author
>>> title: Test
>>> date: Dezember 2018
>>> ---
>>>
>>> Heading
>>> =======
>>>
>>> Test Test Test
>>> ```
>>>
>>> I can then convert the resulting docx to pandoc's native format:
>>>
>>> ```
>>> Pandoc (Meta {unMeta = fromList [("author",MetaInlines [Str
>>> "Author"]),("date",MetaInlines [Str "Dezember",Space,Str
>>> "2018"]),("title",MetaInlines [Str "Test"])]})
>>> [Header 1 ("heading",[],[]) [Str "Heading"]
>>> ,Div ("",[],[("custom-style","FirstParagraph")])
>>>   [Para [Str "Test",Space,Str "Test",Space,Str "Test"]]]
>>> ```
>>>
>>> Now, after making one small edit converting this to pandoc's native
>>> format (`pandoc test.docx -f docx+styles -t native) gives me:
>>>
>>> ```
>>> Pandoc (Meta {unMeta = fromList []})
>>> [Div ("",[],[("custom-style","Titel")])
>>>   [Para [Str "Test"]]
>>> ,Div ("",[],[("custom-style","Author")])
>>>   [Para [Str "Author"]]
>>> ,Div ("",[],[("custom-style","Datum")])
>>>   [Para [Str "Dezember",Space,Str "2018"]]
>>> ,Header 1 ("heading",[],[]) [Str "Heading"]
>>> ,Div ("",[],[("custom-style","FirstParagraph")])
>>>   [Para [Str "Test",Space,Str "Test",Space,Str "Test.",Space,Str
>>> "Another",Space,Str "Test."]]]
>>> ```
>>>
>>> What is going wrong here? As you can see my change was trivial and
>>> occured not in the metadata. Nevertheless, we end up with different styles.
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google Groups
>>> "pandoc-discuss" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>.
>>> To post to this group, send email to pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>>> <javascript:>.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/pandoc-discuss/de0093bf-6f3a-45cc-be5b-02764dd126bd%40googlegroups.com
>>> <https://groups.google.com/d/msgid/pandoc-discuss/de0093bf-6f3a-45cc-be5b-02764dd126bd%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
> 

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/bed73f36-2582-6440-550b-243246feebd9%40gmail.com.
For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-12-25 23:59 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-23 21:55 md-docx-md rountripping not working Denis Maier
     [not found] ` <de0093bf-6f3a-45cc-be5b-02764dd126bd-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-12-24 11:32   ` BP Jonsson
2018-12-25 15:38     ` Denis Maier
     [not found]       ` <5157c829-b27c-4e8c-83b3-44e227c0a637-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-12-25 23:59         ` BP Jonsson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).