public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: Hendrik Seliger <hgseliger-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: Retaining index entries in docx file
Date: Fri, 4 Feb 2022 05:46:40 -0800 (PST)	[thread overview]
Message-ID: <360a19b9-7a61-4452-8a57-e72cd68f05a1n@googlegroups.com> (raw)
In-Reply-To: <m2ilznbs8p.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 5388 bytes --]

Hi!

I just ran into the same problem and solved it somewhat manually, but it 
worked well for a 200+ pages document.
Basic approach: unzip the docx file (these are zip-archives with a 
different extension to the filename), then tweak it a bit to put a text 
entry of the index we can later use in the file converted by Pandoc, re-zip 
the beast and the use Pandoc to create a LaTeX file. With the below 
scripts, the converted file would have a '++index{Index entry}' for each 
entry. Of course, the '++index' needs to be find-and-replaced to '\index'. 
Done.

Here the details (please excuse the Markdown, I copied from my personal 
wiki). Hope this helps the one or other out there…


# Preserving indices
Pandoc does not do indices. So to keep them, unzip the Word file, open 
`document.xml` in Atom, and replace the index entries with the LaTeX 
command.

First make all xml-tags stand on one line. And replace all index entries by 
a LaTeX-command. I am using `++` instead of the `\` to make later 
replacement in the TeX-file easier. _Of course, check after conversion in 
Pandoc that the `++` needs to me manually turned into a `\` for the 
index-commands to work._

Now the difficult part: any lines before and after a line starting with 
`++index` need to be removed, from and until a line starting with something 
else than `<`. And, there could be several index commands running into each 
other without any normal text between.

So we pull out any index entry and make a `++index` out of it using `sed`, 
which is dirty, but quick. Then perl is dropped onto the file to pull the 
`++index` before any xml-tags, which puts it right behind the normal text, 
where they should go later. Then we write the rest out as is, to make sure 
all xml-tags Word kept open are properly closed. One small alteration: any 
text in the Word index command is simply replaced by _FOO_, so it would be 
easier to track if anything went wrong or these indices somehow pop up 
again.

This can be achieved with the following perl-script saved in 
WordIndex2LaTeX.pl (and made executable, so chmod +x WordIndex2LaTeX.pl):
```
#!/usr/bin/perl

# @Author: Hendrik G. Seliger
# @Date:   4 February 2022, 11:34 +01:00
# @Filename: WordIndex2LaTeX.pl
# @Last modified time: 4 February 2022, 11:36 +01:00
# @License: GPL3
# @Copyright: © Copyright 2022 by Hendrik G. Seliger

###
$keptlines='';
$indexlines='';

while ( <STDIN> ) {
        if ( ( $_ =~ /^</ ) || ($_ =~ /^XE / ) ) { # line with xml tag or 
word index entry
                $keptlines .= $_; # save the line
        } elsif ( $_ =~ /^++index/ ) {
                # Found an index entry. Now, put the LaTeX command BEFORE 
the
                # Word tag, so that the tags are correctly opened and 
closed, but
                # the LaTeX command appears first
                $indexlines .= $_; # save the line
        } else { # normal text line, print all kept ones and current, erase 
memory
                print $indexlines;
                print $keptlines;
                print $_;
                $keptlines='';
                $indexlines='';
        }
}
print $indexlines;
print $keptlines;
```

So hence the conversion is achieved with

```
mkdir D
cd D
unzip ../MyDoc.docx
cd word
cat document.xml|sed -E 's/>/>\n/g' |sed -E 's/(.)</\1\n</g' | sed -E 
's/^XE &quot;(.*)&quot;/++index{\1}\nXE \&quot;FOO\&quot;/g' | 
../../WordIndex2LaTeX.pl| tr -d '\n' >document2.xml
```

Back up `document.xml` and rename `document2.xml` to `document.xml`. Re-zip 
the document
```
mv document.xml ../..
mv document2.xml document.xml
cd ..
zip -r ../D.docx *
cd ..
```


John MacFarlane schrieb am Montag, 30. August 2021 um 04:07:17 UTC+2:

>
> Sorry, indexes aren't supported.
>
> DJ Penton <jakep...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>
> > I am new to pandoc. I am under enormous time pressure to convert a docx 
> > file to latex. This has worked beautifully except that alphabetical 
> index 
> > entries in the docx file do not seem to be preserved. I would have 
> expected 
> > a latex \index{} tag. Is there a way to do this?
> >
> > I apologise for asking a question that has probably been answered 
> > repeatedly. I spent 15 minutes searching for an answer and didn't see 
> one. 
> > Probably I just missed it. I must continue with other work on the 
> document 
> > for now.
> >
> > Anyway, thanks in advance; be kind :-)
> >
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/3da0bbf7-36e8-4511-9766-7777ec427133n%40googlegroups.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/360a19b9-7a61-4452-8a57-e72cd68f05a1n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 7125 bytes --]

  parent reply	other threads:[~2022-02-04 13:46 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-29 17:59 DJ Penton
     [not found] ` <3da0bbf7-36e8-4511-9766-7777ec427133n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-08-30  2:07   ` John MacFarlane
     [not found]     ` <m2ilznbs8p.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
2022-02-04 13:46       ` Hendrik Seliger [this message]
     [not found]         ` <360a19b9-7a61-4452-8a57-e72cd68f05a1n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-02-04 13:58           ` Hendrik Seliger
2022-02-04 14:38           ` BPJ

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=360a19b9-7a61-4452-8a57-e72cd68f05a1n@googlegroups.com \
    --to=hgseliger-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).