public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* Move TOC when converting html to docx
@ 2022-07-11  8:48 Ismail Jattioui
       [not found] ` <77066946-d07a-489a-9ec2-99796422f682n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Ismail Jattioui @ 2022-07-11  8:48 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1356 bytes --]

Hi,

I am trying to convert a html file to docx using pandoc. My problem is that 
I can’t manage to move the table of contents to a specific position in the 
document. I tried splitting my document into two, then merging it again but 
it isn’t optimal since we are using it in production and it costs us 2 
calls to pandoc and it isn't very maintanable

I was wondering if there is a way to do that using Lua filters

In a nutshell, let’s say I have the following html document that I wish to 
convert to DOCX :

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="UTF-8" />
    </head>
    <h1>Title 1</h1>
    <p>Some stuff 2</p>
    <h2>Subtitle 1</h2>
    <p>Some stuff 2</p>
    <div>Other things</div>
    <div id="TOC">Insert TOC below</div>
</html>

How do I manage to generate a Table of content below the div with the TOC 
id, without splitting the document ?

Thanks in advance

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/77066946-d07a-489a-9ec2-99796422f682n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 2545 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Move TOC when converting html to docx
       [not found] ` <77066946-d07a-489a-9ec2-99796422f682n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2022-07-12 14:32   ` Ismail Jattioui
       [not found]     ` <88926968-1ca3-40c4-944f-c78e0554ba84n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2022-07-18  8:07   ` John MacFarlane
  1 sibling, 1 reply; 5+ messages in thread
From: Ismail Jattioui @ 2022-07-12 14:32 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 3349 bytes --]

I tried this code which looked like what I want to do, but it still doesn’t 
work unfortunately.

There are apparently no RawBlock in the html I posted and I don't see how 
we can add one 

I tried using Para and Block with no success :/ I got the following error :
PandocLuaError "Trying to set unavailable property text." at the line 
indicated by ---->

The command I am using:

pandoc --metadata toc-title=custom-toc --lua-filter=filter.lua 
input-test.html -o res.docx

The luaFilter I am trying:

------------------------------------------------------
local RAW_TOC = [[
<w:sdt>
<w:sdtContent 
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:p>
<w:r>
<w:fldChar w:fldCharType="begin" w:dirty="true" />
<w:instrText xml:space="preserve">TOC \o "1-3" \h \z \u</w:instrText>
<w:fldChar w:fldCharType="separate" />
<w:fldChar w:fldCharType="end" />
</w:r>
</w:p>
</w:sdtContent>
</w:sdt>
]]
local meta_key = "toc-title"
local vars = {}


local function getVars (meta)
   for k, v in pairs(meta) do
      if v.t == 'MetaInlines' then
         print('isMetaInlines')
         vars["$" .. k .. "$"] = { table.unpack(v) }
      end
   end
end

local function pageBreak(el)
   if el.text == "pandoc-page-break" then
      print('pageBreak')
      return pandoc.Str ""
   else
      return el
   end
end


local function toc(el)
   print(el)
   if pandoc.utils.stringify(el) ==  "pandoc-toc" then
      ----> el.text = RAW_TOC
      el.format = "openxml"
      local para = pandoc.Para(vars)
      local div = pandoc.Div({ para, el })
      div["attr"]["attributes"]["custom-style"] = "TOC Heading"
      return div
   end
end

return {
   { Meta = getVars },
   { Str = pageBreak },
   { RawBlock = toc }
}
------------------------------------------------------
Le lundi 11 juillet 2022 à 10:48:41 UTC+2, Ismail Jattioui a écrit :

> Hi,
>
> I am trying to convert a html file to docx using pandoc. My problem is 
> that I can’t manage to move the table of contents to a specific position in 
> the document. I tried splitting my document into two, then merging it again 
> but it isn’t optimal since we are using it in production and it costs us 2 
> calls to pandoc and it isn't very maintanable
>
> I was wondering if there is a way to do that using Lua filters
>
> In a nutshell, let’s say I have the following html document that I wish to 
> convert to DOCX :
>
> <!DOCTYPE html>
> <html lang="en">
>     <head>
>         <meta charset="UTF-8" />
>     </head>
>     <h1>Title 1</h1>
>     <p>Some stuff 2</p>
>     <h2>Subtitle 1</h2>
>     <p>Some stuff 2</p>
>     <div>Other things</div>
>     <div id="TOC">Insert TOC below</div>
> </html>
>
> How do I manage to generate a Table of content below the div with the TOC 
> id, without splitting the document ?
>
> Thanks in advance
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/88926968-1ca3-40c4-944f-c78e0554ba84n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 6921 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Move TOC when converting html to docx
       [not found]     ` <88926968-1ca3-40c4-944f-c78e0554ba84n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2022-07-18  6:33       ` Ismail Jattioui
  0 siblings, 0 replies; 5+ messages in thread
From: Ismail Jattioui @ 2022-07-18  6:33 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 3614 bytes --]

up please 

Le mardi 12 juillet 2022 à 16:32:43 UTC+2, Ismail Jattioui a écrit :

> I tried this code which looked like what I want to do, but it still 
> doesn’t work unfortunately.
>
> There are apparently no RawBlock in the html I posted and I don't see how 
> we can add one 
>
> I tried using Para and Block with no success :/ I got the following error 
> :
> PandocLuaError "Trying to set unavailable property text." at the line 
> indicated by ---->
>
> The command I am using:
>
> pandoc --metadata toc-title=custom-toc --lua-filter=filter.lua 
> input-test.html -o res.docx
>
> The luaFilter I am trying:
>
> ------------------------------------------------------
> local RAW_TOC = [[
> <w:sdt>
> <w:sdtContent xmlns:w="
> http://schemas.openxmlformats.org/wordprocessingml/2006/main">
> <w:p>
> <w:r>
> <w:fldChar w:fldCharType="begin" w:dirty="true" />
> <w:instrText xml:space="preserve">TOC \o "1-3" \h \z \u</w:instrText>
> <w:fldChar w:fldCharType="separate" />
> <w:fldChar w:fldCharType="end" />
> </w:r>
> </w:p>
> </w:sdtContent>
> </w:sdt>
> ]]
> local meta_key = "toc-title"
> local vars = {}
>
>
> local function getVars (meta)
>    for k, v in pairs(meta) do
>       if v.t == 'MetaInlines' then
>          print('isMetaInlines')
>          vars["$" .. k .. "$"] = { table.unpack(v) }
>       end
>    end
> end
>
> local function pageBreak(el)
>    if el.text == "pandoc-page-break" then
>       print('pageBreak')
>       return pandoc.Str ""
>    else
>       return el
>    end
> end
>
>
> local function toc(el)
>    print(el)
>    if pandoc.utils.stringify(el) ==  "pandoc-toc" then
>       ----> el.text = RAW_TOC
>       el.format = "openxml"
>       local para = pandoc.Para(vars)
>       local div = pandoc.Div({ para, el })
>       div["attr"]["attributes"]["custom-style"] = "TOC Heading"
>       return div
>    end
> end
>
> return {
>    { Meta = getVars },
>    { Str = pageBreak },
>    { RawBlock = toc }
> }
> ------------------------------------------------------
> Le lundi 11 juillet 2022 à 10:48:41 UTC+2, Ismail Jattioui a écrit :
>
>> Hi,
>>
>> I am trying to convert a html file to docx using pandoc. My problem is 
>> that I can’t manage to move the table of contents to a specific position in 
>> the document. I tried splitting my document into two, then merging it again 
>> but it isn’t optimal since we are using it in production and it costs us 2 
>> calls to pandoc and it isn't very maintanable
>>
>> I was wondering if there is a way to do that using Lua filters
>>
>> In a nutshell, let’s say I have the following html document that I wish 
>> to convert to DOCX :
>>
>> <!DOCTYPE html>
>> <html lang="en">
>>     <head>
>>         <meta charset="UTF-8" />
>>     </head>
>>     <h1>Title 1</h1>
>>     <p>Some stuff 2</p>
>>     <h2>Subtitle 1</h2>
>>     <p>Some stuff 2</p>
>>     <div>Other things</div>
>>     <div id="TOC">Insert TOC below</div>
>> </html>
>>
>> How do I manage to generate a Table of content below the div with the TOC 
>> id, without splitting the document ?
>>
>> Thanks in advance
>>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/a9967f45-314e-484c-a642-ecb03c315e10n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 7364 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Move TOC when converting html to docx
       [not found] ` <77066946-d07a-489a-9ec2-99796422f682n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2022-07-12 14:32   ` Ismail Jattioui
@ 2022-07-18  8:07   ` John MacFarlane
       [not found]     ` <EE47F68F-93F4-41CF-B650-7B1E1613D00E-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 5+ messages in thread
From: John MacFarlane @ 2022-07-18  8:07 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

There's a special syntax in the docx file to include the table of contents; you're not going to be able to do it this way.

Maybe your best approach would be to have a script modify the docx after pandoc produces it. A docx is just a zip file containing xml documnets, so you'd need to unzip it, modify document.xml, and zip it back up.  The modification would simply consist of moving the XML elements that produce the TOC to another location in your document.xml.

> On Jul 11, 2022, at 10:48 AM, Ismail Jattioui <ismail.jattioui1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
> Hi,
> 
> I am trying to convert a html file to docx using pandoc. My problem is that I can’t manage to move the table of contents to a specific position in the document. I tried splitting my document into two, then merging it again but it isn’t optimal since we are using it in production and it costs us 2 calls to pandoc and it isn't very maintanable
> 
> I was wondering if there is a way to do that using Lua filters
> 
> In a nutshell, let’s say I have the following html document that I wish to convert to DOCX :
> 
> <!DOCTYPE html>
> <html lang="en">
>     <head>
>         <meta charset="UTF-8" />
>     </head>
>     <h1>Title 1</h1>
>     <p>Some stuff 2</p>
>     <h2>Subtitle 1</h2>
>     <p>Some stuff 2</p>
>     <div>Other things</div>
>     <div id="TOC">Insert TOC below</div>
> </html>
> 
> How do I manage to generate a Table of content below the div with the TOC id, without splitting the document ?
> 
> Thanks in advance
> 
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/77066946-d07a-489a-9ec2-99796422f682n%40googlegroups.com.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/EE47F68F-93F4-41CF-B650-7B1E1613D00E%40gmail.com.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Move TOC when converting html to docx
       [not found]     ` <EE47F68F-93F4-41CF-B650-7B1E1613D00E-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2022-07-21 13:48       ` Ismail Jattioui
  0 siblings, 0 replies; 5+ messages in thread
From: Ismail Jattioui @ 2022-07-21 13:48 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 2882 bytes --]

Thank you so much it works !

Here is a boilerplate solution for someone else who would try it in 
javascript using JSZip library (The advantage using this library is that 
you won't have to extract all files into the disk in order to process them)

https://gist.github.com/jaxalo/bd23a8db85ddc7afc5c9ca668b13c898
Le lundi 18 juillet 2022 à 10:07:07 UTC+2, fiddlosopher a écrit :

> There's a special syntax in the docx file to include the table of 
> contents; you're not going to be able to do it this way.
>
> Maybe your best approach would be to have a script modify the docx after 
> pandoc produces it. A docx is just a zip file containing xml documnets, so 
> you'd need to unzip it, modify document.xml, and zip it back up. The 
> modification would simply consist of moving the XML elements that produce 
> the TOC to another location in your document.xml.
>
> > On Jul 11, 2022, at 10:48 AM, Ismail Jattioui <ismail.j...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 
> wrote:
> > 
> > Hi,
> > 
> > I am trying to convert a html file to docx using pandoc. My problem is 
> that I can’t manage to move the table of contents to a specific position in 
> the document. I tried splitting my document into two, then merging it again 
> but it isn’t optimal since we are using it in production and it costs us 2 
> calls to pandoc and it isn't very maintanable
> > 
> > I was wondering if there is a way to do that using Lua filters
> > 
> > In a nutshell, let’s say I have the following html document that I wish 
> to convert to DOCX :
> > 
> > <!DOCTYPE html>
> > <html lang="en">
> > <head>
> > <meta charset="UTF-8" />
> > </head>
> > <h1>Title 1</h1>
> > <p>Some stuff 2</p>
> > <h2>Subtitle 1</h2>
> > <p>Some stuff 2</p>
> > <div>Other things</div>
> > <div id="TOC">Insert TOC below</div>
> > </html>
> > 
> > How do I manage to generate a Table of content below the div with the 
> TOC id, without splitting the document ?
> > 
> > Thanks in advance
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/77066946-d07a-489a-9ec2-99796422f682n%40googlegroups.com
> .
>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/464972bd-888a-4717-b668-51f0b6a13cd9n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 4325 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-07-21 13:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-11  8:48 Move TOC when converting html to docx Ismail Jattioui
     [not found] ` <77066946-d07a-489a-9ec2-99796422f682n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-07-12 14:32   ` Ismail Jattioui
     [not found]     ` <88926968-1ca3-40c4-944f-c78e0554ba84n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-07-18  6:33       ` Ismail Jattioui
2022-07-18  8:07   ` John MacFarlane
     [not found]     ` <EE47F68F-93F4-41CF-B650-7B1E1613D00E-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-07-21 13:48       ` Ismail Jattioui

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).