* Custom styles in docx to markdown conversion. @ 2021-12-10 12:07 Joost Kremers [not found] ` <a1fcb30f-4d0d-449b-b02e-b375f8e38abe-jFIJ+Wc5/Vo7lZ9V/NTDHw@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: Joost Kremers @ 2021-12-10 12:07 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw Hi list, I'm trying to convert some MS Word documents to markdown, which works surprisingly well, except for two issues I'm wondering about: 1. Section headings aren't converted to `## ...`. This, I assume, is due to the fact that they have a custom style. 2. Most of the text is rendered as block quotes. This seems to be caused by the fact that the body text has a custom style as well, since there are a few paragraphs that are not rendered as block quotes and they don't have any style applied to them in the docx file. So, is there a way to make Pandoc understand these styles and convert them correctly? TIA Joost -- Joost Kremers Life has its moments -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/a1fcb30f-4d0d-449b-b02e-b375f8e38abe%40www.fastmail.com. ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <a1fcb30f-4d0d-449b-b02e-b375f8e38abe-jFIJ+Wc5/Vo7lZ9V/NTDHw@public.gmane.org>]
* Re: Custom styles in docx to markdown conversion. [not found] ` <a1fcb30f-4d0d-449b-b02e-b375f8e38abe-jFIJ+Wc5/Vo7lZ9V/NTDHw@public.gmane.org> @ 2021-12-10 16:56 ` John MacFarlane [not found] ` <m235n04cw4.fsf-d8241O7hbXoP5tpWdHSM3tPlBySK3R6THiGdP5j34PU@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: John MacFarlane @ 2021-12-10 16:56 UTC (permalink / raw) To: Joost Kremers, pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw "Joost Kremers" <joostkremers-97jfqw80gc6171pxa8y+qA@public.gmane.org> writes: > Hi list, > > I'm trying to convert some MS Word documents to markdown, which works surprisingly well, except for two issues I'm wondering about: > > 1. Section headings aren't converted to `## ...`. This, I assume, is due to the fact that they have a custom style. Hard to say without an example. Can you post a sample docx? ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <m235n04cw4.fsf-d8241O7hbXoP5tpWdHSM3tPlBySK3R6THiGdP5j34PU@public.gmane.org>]
* Re: Custom styles in docx to markdown conversion. [not found] ` <m235n04cw4.fsf-d8241O7hbXoP5tpWdHSM3tPlBySK3R6THiGdP5j34PU@public.gmane.org> @ 2021-12-10 19:39 ` Joost Kremers [not found] ` <877dcckzsu.fsf-97jfqw80gc6171pxa8y+qA@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: Joost Kremers @ 2021-12-10 19:39 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw; +Cc: John MacFarlane On Fri, Dec 10 2021, John MacFarlane wrote: > "Joost Kremers" <joostkremers-97jfqw80gc6171pxa8y+qA@public.gmane.org> writes: >> I'm trying to convert some MS Word documents to markdown, which works >> surprisingly well, except for two issues I'm wondering about: >> >> 1. Section headings aren't converted to `## ...`. This, I assume, is due to >> the fact that they have a custom style. > > Hard to say without an example. Can you post a sample docx? No, unfortunately not, it's confidential material. When I convert with `-f docx+styles` I get things like: ``` ::: {custom-style="XYZ Minor Head (1.1)"} Lorem ipsum ::: ::: {custom-style="XYZ Body Text"} > Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod > tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, > quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo > consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse > cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non > proident, sunt in culpa qui officia deserunt mollit anim id est laborum. ::: ``` Etc., etc. Does that help? Thanks, Joost -- Joost Kremers Life has its moments ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <877dcckzsu.fsf-97jfqw80gc6171pxa8y+qA@public.gmane.org>]
* Re: Custom styles in docx to markdown conversion. [not found] ` <877dcckzsu.fsf-97jfqw80gc6171pxa8y+qA@public.gmane.org> @ 2021-12-11 0:07 ` John MacFarlane [not found] ` <yh480ka6h8t35a.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: John MacFarlane @ 2021-12-11 0:07 UTC (permalink / raw) To: Joost Kremers, pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw > Does that help? Yeah, that's enough information for me. What you need to do is to write a Lua filter like this: function Div(el) if el.attributes['custom-style']:match('XYZ Minor Head') then return pandoc.Header(2, pandoc.utils.blocks_to_inlines(el.content)) end end Hope it's clear what this does. ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <yh480ka6h8t35a.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>]
* Re: Custom styles in docx to markdown conversion. [not found] ` <yh480ka6h8t35a.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org> @ 2021-12-15 9:47 ` Joost Kremers 2021-12-16 12:28 ` Joost Kremers 1 sibling, 0 replies; 7+ messages in thread From: Joost Kremers @ 2021-12-15 9:47 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw; +Cc: John MacFarlane On Fri, Dec 10 2021, John MacFarlane wrote: >> Does that help? > > Yeah, that's enough information for me. > > What you need to do is to write a Lua filter like this: > > function Div(el) > if el.attributes['custom-style']:match('XYZ Minor Head') then > return pandoc.Header(2, pandoc.utils.blocks_to_inlines(el.content)) > end > end > > Hope it's clear what this does. Thanks, yes, I think I can follow. I'll need to dig into creating lua filters, but this should get me started. -- Joost Kremers Life has its moments ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Custom styles in docx to markdown conversion. [not found] ` <yh480ka6h8t35a.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org> 2021-12-15 9:47 ` Joost Kremers @ 2021-12-16 12:28 ` Joost Kremers [not found] ` <87czlwhel7.fsf-97jfqw80gc6171pxa8y+qA@public.gmane.org> 1 sibling, 1 reply; 7+ messages in thread From: Joost Kremers @ 2021-12-16 12:28 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw; +Cc: John MacFarlane On Fri, Dec 10 2021, John MacFarlane wrote: >> Does that help? > > Yeah, that's enough information for me. > > What you need to do is to write a Lua filter like this: > > function Div(el) > if el.attributes['custom-style']:match('XYZ Minor Head') then > return pandoc.Header(2, pandoc.utils.blocks_to_inlines(el.content)) > end > end > > Hope it's clear what this does. For some reason, it doesn't work... I tried to extend your filter to the following: ``` function Div(el) if el.attributes['custom-style']:match('XYZ Major Head') then return pandoc.Header(1, pandoc.utils.blocks_to_inlines(el.content)) elseif el.attributes['custom-style']:match('XYZ Minor Head') then return pandoc.Header(2, pandoc.utils.blocks_to_inlines(el.content)) elseif el.attributes['custom-style']:match('XYZ Body Text') then return pandoc.Para(pandoc.utils.blocks_to_inlines(el.content)) end end ``` Using this filter, the custom style 'XYZ Body Text' is converted, but the Major and Minor Heads are not. When I convert to native (without the filter), I don't see a difference between Body Text on the one hand and Major or Minor Heads on the other: both are Div elements with "custom-style" set as indicated. Only the body text is changed, the headers are not. Could the problem be that the header Div's tend to appear inside an OrderedList? For some strange reason, the Major and Minor Heads don't use numbering. Instead, each header is an item in a numbered list... Is there a way to clean up such cases? I.e., get rid of any OrderedList that immediately contains a Major/Minor Head, but leave "normal" OrderedLists intact? Another question: body text in the converted document is often enclosed in a Span with a specific custom-style. I'd like to get rid of the span, since the style is of no interest to me, but I'm not sure what I should have the function return. For example, the following: ``` function Span(el) if el.attributes['custom-style']:match('XYZ Body Text Char') then return pandoc.Para(pandoc.utils.blocks_to_inlines(el.content)) end end ``` raises an error. I also tried converting to Plain (honestly, I don't know what the correct type would be), and I tried just passing `el.content` to `pandoc.Para`, but I keep getting errors. (Specifically: "Block expected, got userdata", and also "table expected, got userdata" with Plain instead of Para.) I apologise for what is probably a barrage of newbie questions, but having no previous knowledge of Lua and only a vague understanding of Pandoc's internal data types, I have a hard time figuring things out from the documentation. I appreciate any pointers. TIA -- Joost Kremers Life has its moments ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <87czlwhel7.fsf-97jfqw80gc6171pxa8y+qA@public.gmane.org>]
* Re: Custom styles in docx to markdown conversion. [not found] ` <87czlwhel7.fsf-97jfqw80gc6171pxa8y+qA@public.gmane.org> @ 2021-12-16 14:06 ` Bastien DUMONT 0 siblings, 0 replies; 7+ messages in thread From: Bastien DUMONT @ 2021-12-16 14:06 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw As for your second question, what you want is to get the content of the span, without the containing span element itself. So: ``` function Span(el) if el.attributes['custom-style']:match('XYZ Body Text Char') then return el.content end end ``` The solution to "get rid of any OrderedList that immediately contains a Major/Minor Head, but leave "normal" OrderedLists intact" is similar: since the content of an OrderedLists is a list of lists of Blocks, you want the return the content of the only Block in the first list of Blocks. This may work: ``` function OrderedList(el) local possibleHeader = el.content[1][1] if possibleHeader.t == 'Div' and possibleHeader.attributes['custom-style']:match('XYZ Minor Head') then return pandoc.Header(2, pandoc.utils.blocks_to_inlines(possibleHeader.content)) end end ``` Le Thursday 16 December 2021 à 01:28:18PM, Joost Kremers a écrit : > > On Fri, Dec 10 2021, John MacFarlane wrote: > >> Does that help? > > > > Yeah, that's enough information for me. > > > > What you need to do is to write a Lua filter like this: > > > > function Div(el) > > if el.attributes['custom-style']:match('XYZ Minor Head') then > > return pandoc.Header(2, pandoc.utils.blocks_to_inlines(el.content)) > > end > > end > > > > Hope it's clear what this does. > > For some reason, it doesn't work... I tried to extend your filter to the > following: > > ``` > function Div(el) > if el.attributes['custom-style']:match('XYZ Major Head') then > return pandoc.Header(1, pandoc.utils.blocks_to_inlines(el.content)) > elseif el.attributes['custom-style']:match('XYZ Minor Head') then > return pandoc.Header(2, pandoc.utils.blocks_to_inlines(el.content)) > elseif el.attributes['custom-style']:match('XYZ Body Text') then > return pandoc.Para(pandoc.utils.blocks_to_inlines(el.content)) > end > end > ``` > > Using this filter, the custom style 'XYZ Body Text' is converted, but the Major > and Minor Heads are not. When I convert to native (without the filter), I don't > see a difference between Body Text on the one hand and Major or Minor Heads on > the other: both are Div elements with "custom-style" set as indicated. Only the > body text is changed, the headers are not. > > Could the problem be that the header Div's tend to appear inside an OrderedList? > For some strange reason, the Major and Minor Heads don't use numbering. Instead, > each header is an item in a numbered list... Is there a way to clean up such > cases? I.e., get rid of any OrderedList that immediately contains a Major/Minor > Head, but leave "normal" OrderedLists intact? > > Another question: body text in the converted document is often enclosed in a > Span with a specific custom-style. I'd like to get rid of the span, since the > style is of no interest to me, but I'm not sure what I should have the function > return. For example, the following: > > ``` > function Span(el) > if el.attributes['custom-style']:match('XYZ Body Text Char') then > return pandoc.Para(pandoc.utils.blocks_to_inlines(el.content)) > end > end > ``` > > raises an error. I also tried converting to Plain (honestly, I don't know what > the correct type would be), and I tried just passing `el.content` to > `pandoc.Para`, but I keep getting errors. (Specifically: "Block > expected, got userdata", and also "table expected, got userdata" with Plain > instead of Para.) > > I apologise for what is probably a barrage of newbie questions, but having no > previous knowledge of Lua and only a vague understanding of Pandoc's internal > data types, I have a hard time figuring things out from the documentation. > > I appreciate any pointers. > > TIA > > -- > Joost Kremers > Life has its moments > > -- > You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/87czlwhel7.fsf%40fastmail.fm. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/YbtH2yG9dD%2BfbURH%40localhost. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-12-16 14:06 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-12-10 12:07 Custom styles in docx to markdown conversion Joost Kremers [not found] ` <a1fcb30f-4d0d-449b-b02e-b375f8e38abe-jFIJ+Wc5/Vo7lZ9V/NTDHw@public.gmane.org> 2021-12-10 16:56 ` John MacFarlane [not found] ` <m235n04cw4.fsf-d8241O7hbXoP5tpWdHSM3tPlBySK3R6THiGdP5j34PU@public.gmane.org> 2021-12-10 19:39 ` Joost Kremers [not found] ` <877dcckzsu.fsf-97jfqw80gc6171pxa8y+qA@public.gmane.org> 2021-12-11 0:07 ` John MacFarlane [not found] ` <yh480ka6h8t35a.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org> 2021-12-15 9:47 ` Joost Kremers 2021-12-16 12:28 ` Joost Kremers [not found] ` <87czlwhel7.fsf-97jfqw80gc6171pxa8y+qA@public.gmane.org> 2021-12-16 14:06 ` Bastien DUMONT
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).