Re: Retaining index entries in docx file

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

From: Hendrik Seliger <hgseliger-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: Retaining index entries in docx file
Date: Fri, 4 Feb 2022 05:58:38 -0800 (PST)	[thread overview]
Message-ID: <146889bd-867e-4b83-8b21-98bb01e559d7n@googlegroups.com> (raw)
In-Reply-To: <360a19b9-7a61-4452-8a57-e72cd68f05a1n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 5932 bytes --]

BTW, just see thatP Pandoc of course also escapes the {} after index, so 
those need to be fixed as well. Easiest by pushing through sed, too:

pandoc D.docx -t latex | sed -E 's/\+\+index\\\{(.*?)\\\}/\\index{\1}/g' 
>D.tex

Hendrik Seliger schrieb am Freitag, 4. Februar 2022 um 14:46:40 UTC+1:

> Hi!
>
> I just ran into the same problem and solved it somewhat manually, but it 
> worked well for a 200+ pages document.
> Basic approach: unzip the docx file (these are zip-archives with a 
> different extension to the filename), then tweak it a bit to put a text 
> entry of the index we can later use in the file converted by Pandoc, re-zip 
> the beast and the use Pandoc to create a LaTeX file. With the below 
> scripts, the converted file would have a '++index{Index entry}' for each 
> entry. Of course, the '++index' needs to be find-and-replaced to '\index'. 
> Done.
>
> Here the details (please excuse the Markdown, I copied from my personal 
> wiki). Hope this helps the one or other out there…
>
>
> # Preserving indices
> Pandoc does not do indices. So to keep them, unzip the Word file, open 
> `document.xml` in Atom, and replace the index entries with the LaTeX 
> command.
>
> First make all xml-tags stand on one line. And replace all index entries 
> by a LaTeX-command. I am using `++` instead of the `\` to make later 
> replacement in the TeX-file easier. _Of course, check after conversion in 
> Pandoc that the `++` needs to me manually turned into a `\` for the 
> index-commands to work._
>
> Now the difficult part: any lines before and after a line starting with 
> `++index` need to be removed, from and until a line starting with something 
> else than `<`. And, there could be several index commands running into each 
> other without any normal text between.
>
> So we pull out any index entry and make a `++index` out of it using `sed`, 
> which is dirty, but quick. Then perl is dropped onto the file to pull the 
> `++index` before any xml-tags, which puts it right behind the normal text, 
> where they should go later. Then we write the rest out as is, to make sure 
> all xml-tags Word kept open are properly closed. One small alteration: any 
> text in the Word index command is simply replaced by _FOO_, so it would be 
> easier to track if anything went wrong or these indices somehow pop up 
> again.
>
> This can be achieved with the following perl-script saved in 
> WordIndex2LaTeX.pl (and made executable, so chmod +x WordIndex2LaTeX.pl):
> ```
> #!/usr/bin/perl
>
> # @Author: Hendrik G. Seliger
> # @Date:   4 February 2022, 11:34 +01:00
> # @Filename: WordIndex2LaTeX.pl
> # @Last modified time: 4 February 2022, 11:36 +01:00
> # @License: GPL3
> # @Copyright: © Copyright 2022 by Hendrik G. Seliger
>
> ###
> $keptlines='';
> $indexlines='';
>
> while ( <STDIN> ) {
>         if ( ( $_ =~ /^</ ) || ($_ =~ /^XE / ) ) { # line with xml tag or 
> word index entry
>                 $keptlines .= $_; # save the line
>         } elsif ( $_ =~ /^++index/ ) {
>                 # Found an index entry. Now, put the LaTeX command BEFORE 
> the
>                 # Word tag, so that the tags are correctly opened and 
> closed, but
>                 # the LaTeX command appears first
>                 $indexlines .= $_; # save the line
>         } else { # normal text line, print all kept ones and current, 
> erase memory
>                 print $indexlines;
>                 print $keptlines;
>                 print $_;
>                 $keptlines='';
>                 $indexlines='';
>         }
> }
> print $indexlines;
> print $keptlines;
> ```
>
> So hence the conversion is achieved with
>
> ```
> mkdir D
> cd D
> unzip ../MyDoc.docx
> cd word
> cat document.xml|sed -E 's/>/>\n/g' |sed -E 's/(.)</\1\n</g' | sed -E 
> 's/^XE &quot;(.*)&quot;/++index{\1}\nXE \&quot;FOO\&quot;/g' | 
> ../../WordIndex2LaTeX.pl| tr -d '\n' >document2.xml
> ```
>
> Back up `document.xml` and rename `document2.xml` to `document.xml`. 
> Re-zip the document
> ```
> mv document.xml ../..
> mv document2.xml document.xml
> cd ..
> zip -r ../D.docx *
> cd ..
> ```
>
>
> John MacFarlane schrieb am Montag, 30. August 2021 um 04:07:17 UTC+2:
>
>>
>> Sorry, indexes aren't supported. 
>>
>> DJ Penton <jakep...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes: 
>>
>> > I am new to pandoc. I am under enormous time pressure to convert a docx 
>> > file to latex. This has worked beautifully except that alphabetical 
>> index 
>> > entries in the docx file do not seem to be preserved. I would have 
>> expected 
>> > a latex \index{} tag. Is there a way to do this? 
>> > 
>> > I apologise for asking a question that has probably been answered 
>> > repeatedly. I spent 15 minutes searching for an answer and didn't see 
>> one. 
>> > Probably I just missed it. I must continue with other work on the 
>> document 
>> > for now. 
>> > 
>> > Anyway, thanks in advance; be kind :-) 
>> > 
>> > -- 
>> > You received this message because you are subscribed to the Google 
>> Groups "pandoc-discuss" group. 
>> > To unsubscribe from this group and stop receiving emails from it, send 
>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org 
>> > To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/pandoc-discuss/3da0bbf7-36e8-4511-9766-7777ec427133n%40googlegroups.com. 
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/146889bd-867e-4b83-8b21-98bb01e559d7n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 7374 bytes --]

next prev parent reply	other threads:[~2022-02-04 13:58 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-29 17:59 DJ Penton
     [not found] ` <3da0bbf7-36e8-4511-9766-7777ec427133n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-08-30  2:07   ` John MacFarlane
     [not found]     ` <m2ilznbs8p.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
2022-02-04 13:46       ` Hendrik Seliger
     [not found]         ` <360a19b9-7a61-4452-8a57-e72cd68f05a1n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-02-04 13:58           ` Hendrik Seliger [this message]
2022-02-04 14:38           ` BPJ

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=146889bd-867e-4b83-8b21-98bb01e559d7n@googlegroups.com \
    --to=hgseliger-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).