public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* Custom styles in docx to markdown conversion.
@ 2021-12-10 12:07 Joost Kremers
       [not found] ` <a1fcb30f-4d0d-449b-b02e-b375f8e38abe-jFIJ+Wc5/Vo7lZ9V/NTDHw@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Joost Kremers @ 2021-12-10 12:07 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Hi list,

I'm trying to convert some MS Word documents to markdown, which works surprisingly well, except for two issues I'm wondering about:

1. Section headings aren't converted to `## ...`. This, I assume, is due to the fact that they have a custom style.

2. Most of the text is rendered as block quotes. This seems to be caused by the fact that the body text has a custom style as well, since there are a few paragraphs that are not rendered as block quotes and they don't have any style applied to them in the docx file.

So, is there a way to make Pandoc understand these styles and convert them correctly?

TIA

Joost



-- 
Joost Kremers
Life has its moments

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/a1fcb30f-4d0d-449b-b02e-b375f8e38abe%40www.fastmail.com.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Custom styles in docx to markdown conversion.
       [not found] ` <a1fcb30f-4d0d-449b-b02e-b375f8e38abe-jFIJ+Wc5/Vo7lZ9V/NTDHw@public.gmane.org>
@ 2021-12-10 16:56   ` John MacFarlane
       [not found]     ` <m235n04cw4.fsf-d8241O7hbXoP5tpWdHSM3tPlBySK3R6THiGdP5j34PU@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: John MacFarlane @ 2021-12-10 16:56 UTC (permalink / raw)
  To: Joost Kremers, pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

"Joost Kremers" <joostkremers-97jfqw80gc6171pxa8y+qA@public.gmane.org> writes:

> Hi list,
>
> I'm trying to convert some MS Word documents to markdown, which works surprisingly well, except for two issues I'm wondering about:
>
> 1. Section headings aren't converted to `## ...`. This, I assume, is due to the fact that they have a custom style.

Hard to say without an example. Can you post a sample docx?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Custom styles in docx to markdown conversion.
       [not found]     ` <m235n04cw4.fsf-d8241O7hbXoP5tpWdHSM3tPlBySK3R6THiGdP5j34PU@public.gmane.org>
@ 2021-12-10 19:39       ` Joost Kremers
       [not found]         ` <877dcckzsu.fsf-97jfqw80gc6171pxa8y+qA@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Joost Kremers @ 2021-12-10 19:39 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw; +Cc: John MacFarlane


On Fri, Dec 10 2021, John MacFarlane wrote:
> "Joost Kremers" <joostkremers-97jfqw80gc6171pxa8y+qA@public.gmane.org> writes:
>> I'm trying to convert some MS Word documents to markdown, which works
>> surprisingly well, except for two issues I'm wondering about:
>>
>> 1. Section headings aren't converted to `## ...`. This, I assume, is due to
>> the fact that they have a custom style.
>
> Hard to say without an example. Can you post a sample docx?

No, unfortunately not, it's confidential material. When I convert with `-f
docx+styles` I get things like:

```
::: {custom-style="XYZ Minor Head (1.1)"}
Lorem ipsum
:::

::: {custom-style="XYZ Body Text"}
> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
> tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
> quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
> consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
> cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
> proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
:::
```

Etc., etc.

Does that help?

Thanks,

Joost


-- 
Joost Kremers
Life has its moments


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Custom styles in docx to markdown conversion.
       [not found]         ` <877dcckzsu.fsf-97jfqw80gc6171pxa8y+qA@public.gmane.org>
@ 2021-12-11  0:07           ` John MacFarlane
       [not found]             ` <yh480ka6h8t35a.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: John MacFarlane @ 2021-12-11  0:07 UTC (permalink / raw)
  To: Joost Kremers, pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


> Does that help?

Yeah, that's enough information for me.

What you need to do is to write a Lua filter like this:

function Div(el)
  if el.attributes['custom-style']:match('XYZ Minor Head') then
    return pandoc.Header(2, pandoc.utils.blocks_to_inlines(el.content))
  end
end

Hope it's clear what this does.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Custom styles in docx to markdown conversion.
       [not found]             ` <yh480ka6h8t35a.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
@ 2021-12-15  9:47               ` Joost Kremers
  2021-12-16 12:28               ` Joost Kremers
  1 sibling, 0 replies; 7+ messages in thread
From: Joost Kremers @ 2021-12-15  9:47 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw; +Cc: John MacFarlane

On Fri, Dec 10 2021, John MacFarlane wrote:
>> Does that help?
>
> Yeah, that's enough information for me.
>
> What you need to do is to write a Lua filter like this:
>
> function Div(el)
>   if el.attributes['custom-style']:match('XYZ Minor Head') then
>     return pandoc.Header(2, pandoc.utils.blocks_to_inlines(el.content))
>   end
> end
>
> Hope it's clear what this does.

Thanks, yes, I think I can follow. I'll need to dig into creating lua filters,
but this should get me started.

-- 
Joost Kremers
Life has its moments


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Custom styles in docx to markdown conversion.
       [not found]             ` <yh480ka6h8t35a.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
  2021-12-15  9:47               ` Joost Kremers
@ 2021-12-16 12:28               ` Joost Kremers
       [not found]                 ` <87czlwhel7.fsf-97jfqw80gc6171pxa8y+qA@public.gmane.org>
  1 sibling, 1 reply; 7+ messages in thread
From: Joost Kremers @ 2021-12-16 12:28 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw; +Cc: John MacFarlane


On Fri, Dec 10 2021, John MacFarlane wrote:
>> Does that help?
>
> Yeah, that's enough information for me.
>
> What you need to do is to write a Lua filter like this:
>
> function Div(el)
>   if el.attributes['custom-style']:match('XYZ Minor Head') then
>     return pandoc.Header(2, pandoc.utils.blocks_to_inlines(el.content))
>   end
> end
>
> Hope it's clear what this does.

For some reason, it doesn't work... I tried to extend your filter to the
following:

```
function Div(el)
  if el.attributes['custom-style']:match('XYZ Major Head') then
    return pandoc.Header(1, pandoc.utils.blocks_to_inlines(el.content))
  elseif el.attributes['custom-style']:match('XYZ Minor Head') then
    return pandoc.Header(2, pandoc.utils.blocks_to_inlines(el.content))
  elseif el.attributes['custom-style']:match('XYZ Body Text') then
    return pandoc.Para(pandoc.utils.blocks_to_inlines(el.content))
  end
end
```

Using this filter, the custom style 'XYZ Body Text' is converted, but the Major
and Minor Heads are not. When I convert to native (without the filter), I don't
see a difference between Body Text on the one hand and Major or Minor Heads on
the other: both are Div elements with "custom-style" set as indicated. Only the
body text is changed, the headers are not.

Could the problem be that the header Div's tend to appear inside an OrderedList?
For some strange reason, the Major and Minor Heads don't use numbering. Instead,
each header is an item in a numbered list... Is there a way to clean up such
cases? I.e., get rid of any OrderedList that immediately contains a Major/Minor
Head, but leave "normal" OrderedLists intact?

Another question: body text in the converted document is often enclosed in a
Span with a specific custom-style. I'd like to get rid of the span, since the
style is of no interest to me, but I'm not sure what I should have the function
return. For example, the following:

```
function Span(el)
  if el.attributes['custom-style']:match('XYZ Body Text Char') then
    return pandoc.Para(pandoc.utils.blocks_to_inlines(el.content))
  end
end
```

raises an error. I also tried converting to Plain (honestly, I don't know what
the correct type would be), and I tried just passing `el.content` to
`pandoc.Para`, but I keep getting errors. (Specifically: "Block
expected, got userdata", and also "table expected, got userdata" with Plain
instead of Para.)

I apologise for what is probably a barrage of newbie questions, but having no
previous knowledge of Lua and only a vague understanding of Pandoc's internal
data types, I have a hard time figuring things out from the documentation.

I appreciate any pointers.

TIA

-- 
Joost Kremers
Life has its moments


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Custom styles in docx to markdown conversion.
       [not found]                 ` <87czlwhel7.fsf-97jfqw80gc6171pxa8y+qA@public.gmane.org>
@ 2021-12-16 14:06                   ` Bastien DUMONT
  0 siblings, 0 replies; 7+ messages in thread
From: Bastien DUMONT @ 2021-12-16 14:06 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

As for your second question, what you want is to get the content of the span, without the containing span element itself. So:

```
function Span(el)
  if el.attributes['custom-style']:match('XYZ Body Text Char') then
    return el.content
  end
end
```

The solution to "get rid of any OrderedList that immediately contains a Major/Minor Head, but leave "normal" OrderedLists intact" is similar: since the content of an OrderedLists is a list of lists of Blocks, you want the return the content of the only Block in the first list of Blocks. This may work:

```
function OrderedList(el)
  local possibleHeader = el.content[1][1]
  if possibleHeader.t == 'Div'
    and possibleHeader.attributes['custom-style']:match('XYZ Minor Head')
  then
    return pandoc.Header(2, pandoc.utils.blocks_to_inlines(possibleHeader.content))
  end
end
```


Le Thursday 16 December 2021 à 01:28:18PM, Joost Kremers a écrit :
> 
> On Fri, Dec 10 2021, John MacFarlane wrote:
> >> Does that help?
> >
> > Yeah, that's enough information for me.
> >
> > What you need to do is to write a Lua filter like this:
> >
> > function Div(el)
> >   if el.attributes['custom-style']:match('XYZ Minor Head') then
> >     return pandoc.Header(2, pandoc.utils.blocks_to_inlines(el.content))
> >   end
> > end
> >
> > Hope it's clear what this does.
> 
> For some reason, it doesn't work... I tried to extend your filter to the
> following:
> 
> ```
> function Div(el)
>   if el.attributes['custom-style']:match('XYZ Major Head') then
>     return pandoc.Header(1, pandoc.utils.blocks_to_inlines(el.content))
>   elseif el.attributes['custom-style']:match('XYZ Minor Head') then
>     return pandoc.Header(2, pandoc.utils.blocks_to_inlines(el.content))
>   elseif el.attributes['custom-style']:match('XYZ Body Text') then
>     return pandoc.Para(pandoc.utils.blocks_to_inlines(el.content))
>   end
> end
> ```
> 
> Using this filter, the custom style 'XYZ Body Text' is converted, but the Major
> and Minor Heads are not. When I convert to native (without the filter), I don't
> see a difference between Body Text on the one hand and Major or Minor Heads on
> the other: both are Div elements with "custom-style" set as indicated. Only the
> body text is changed, the headers are not.
> 
> Could the problem be that the header Div's tend to appear inside an OrderedList?
> For some strange reason, the Major and Minor Heads don't use numbering. Instead,
> each header is an item in a numbered list... Is there a way to clean up such
> cases? I.e., get rid of any OrderedList that immediately contains a Major/Minor
> Head, but leave "normal" OrderedLists intact?
> 
> Another question: body text in the converted document is often enclosed in a
> Span with a specific custom-style. I'd like to get rid of the span, since the
> style is of no interest to me, but I'm not sure what I should have the function
> return. For example, the following:
> 
> ```
> function Span(el)
>   if el.attributes['custom-style']:match('XYZ Body Text Char') then
>     return pandoc.Para(pandoc.utils.blocks_to_inlines(el.content))
>   end
> end
> ```
> 
> raises an error. I also tried converting to Plain (honestly, I don't know what
> the correct type would be), and I tried just passing `el.content` to
> `pandoc.Para`, but I keep getting errors. (Specifically: "Block
> expected, got userdata", and also "table expected, got userdata" with Plain
> instead of Para.)
> 
> I apologise for what is probably a barrage of newbie questions, but having no
> previous knowledge of Lua and only a vague understanding of Pandoc's internal
> data types, I have a hard time figuring things out from the documentation.
> 
> I appreciate any pointers.
> 
> TIA
> 
> -- 
> Joost Kremers
> Life has its moments
> 
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/87czlwhel7.fsf%40fastmail.fm.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/YbtH2yG9dD%2BfbURH%40localhost.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-12-16 14:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-10 12:07 Custom styles in docx to markdown conversion Joost Kremers
     [not found] ` <a1fcb30f-4d0d-449b-b02e-b375f8e38abe-jFIJ+Wc5/Vo7lZ9V/NTDHw@public.gmane.org>
2021-12-10 16:56   ` John MacFarlane
     [not found]     ` <m235n04cw4.fsf-d8241O7hbXoP5tpWdHSM3tPlBySK3R6THiGdP5j34PU@public.gmane.org>
2021-12-10 19:39       ` Joost Kremers
     [not found]         ` <877dcckzsu.fsf-97jfqw80gc6171pxa8y+qA@public.gmane.org>
2021-12-11  0:07           ` John MacFarlane
     [not found]             ` <yh480ka6h8t35a.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
2021-12-15  9:47               ` Joost Kremers
2021-12-16 12:28               ` Joost Kremers
     [not found]                 ` <87czlwhel7.fsf-97jfqw80gc6171pxa8y+qA@public.gmane.org>
2021-12-16 14:06                   ` Bastien DUMONT

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).