From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/30119
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Hendrik Seliger <hgseliger-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Newsgroups: gmane.text.pandoc
Subject: Re: Retaining index entries in docx file
Date: Fri, 4 Feb 2022 05:46:40 -0800 (PST)
Message-ID: <360a19b9-7a61-4452-8a57-e72cd68f05a1n@googlegroups.com>
References: <3da0bbf7-36e8-4511-9766-7777ec427133n@googlegroups.com>
 <m2ilznbs8p.fsf@MacBook-Pro-2.hsd1.ca.comcast.net>
Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
Mime-Version: 1.0
Content-Type: multipart/mixed; 
	boundary="----=_Part_2988_952800547.1643982400076"
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="35711"; mail-complaints-to="usenet@ciao.gmane.io"
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Original-X-From: pandoc-discuss+bncBDR2BA73YEOBBQO46SHQMGQETLERCEY-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Fri Feb 04 14:46:45 2022
Return-path: <pandoc-discuss+bncBDR2BA73YEOBBQO46SHQMGQETLERCEY-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org
Original-Received: from mail-ot1-f61.google.com ([209.85.210.61])
	by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128)
	(Exim 4.92)
	(envelope-from <pandoc-discuss+bncBDR2BA73YEOBBQO46SHQMGQETLERCEY-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>)
	id 1nFyvF-00093q-4M
	for gtp-pandoc-discuss@m.gmane-mx.org; Fri, 04 Feb 2022 14:46:45 +0100
Original-Received: by mail-ot1-f61.google.com with SMTP id a4-20020a9d5c84000000b005a1daff4564sf1712842oti.2
        for <gtp-pandoc-discuss@m.gmane-mx.org>; Fri, 04 Feb 2022 05:46:45 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=googlegroups.com; s=20210112;
        h=sender:date:from:to:message-id:in-reply-to:references:subject
         :mime-version:x-original-sender:reply-to:precedence:mailing-list
         :list-id:list-post:list-help:list-archive:list-subscribe
         :list-unsubscribe;
        bh=sQUmdFwxp7H5UQMpwFJQbvy5awLgUXuGlphxchuraSY=;
        b=bXuI0Z2i1OvHNIJTFWMCILo67CpFRB86W7ZXVHClzwmDNqULUjF2aIaoX9kjchMwm5
         33VXfV7zvM1Mb8wP1G0MxhkR9rJCy1m6ydZh50TqgMrx/wFTO6hryKIWyN9LZa11MqxP
         4PzuDRNZM0Ivd2skx76rIKx5yD4T+VPNOjntK0IX/p9W6PMY+/x5VZG4cjquEVod0AGk
         W7fRI84aD0WxK2AU3G8vYrkUK+O/IzKPu16nU6bPxyLVhV440yfbHMWf4NPrT1O8xxwE
         75u4xUOHUwXWTlN5tyZGFtsAYl463tRCFtwXV5IkRWxqrpUx7BIM8/ve9LQpz5E+iqCS
         q/JQ==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=date:from:to:message-id:in-reply-to:references:subject:mime-version
         :x-original-sender:reply-to:precedence:mailing-list:list-id
         :list-post:list-help:list-archive:list-subscribe:list-unsubscribe;
        bh=sQUmdFwxp7H5UQMpwFJQbvy5awLgUXuGlphxchuraSY=;
        b=bv9X7lxrU8n9MwdHHzwfVakk4KFsWj8qhuChK5uVHUcKLPwV5zPK5SVIzpdNVLsZby
         mZvb66IkTZ1XB1xd3Onv6TdWx3ncocOG6l8gHKLM8u/gWpsMMw5wo2GL0QD/CQGIkk+k
         fk+zzgZFMyu7vagbI7IJx81Jrv0KDBH7W/RhgYzmupFtmSzJFoHjAPdlGaw9t7WLDGAX
         f7qQMstDBQGXBxHkYWAz7D3gSTNkgRbiyuKkikyxs+Ee+3yyjmkIucDgUWNXeyzpbAR1
         9b8tdN8DpLRhZvQrvi/EwNM8E6nhQSilN1YyPbCIaufIenPrZdoXqqOk4vZNqxtTKlp2
         r+tg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to
         :references:subject:mime-version:x-original-sender:reply-to
         :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post
         :list-help:list-archive:list-subscribe:list-unsubscribe;
        bh=sQUmdFwxp7H5UQMpwFJQbvy5awLgUXuGlphxchuraSY=;
        b=cvhhojZisnGyxTcoo113HuzhrbQTuE0iinhMqdEtY7XDuho5td7pD/n0kJRgYLxZAT
         jSOznEOXF8EafeQhTlBKf6ysHHUk2QGc5iHyiM22Ajz8N7b34X8vUxL1XqeXICkSj3E6
         RAz8eZ9Xu/V194jHxqTUVeHygGK4Yu9S1f5eq+vZdZ+yGRp8S0WSAhHVDBja4ZEP8Kfl
         I3prinYMF7+X6tXLapoVx5kNHhTEFgW3LmSSdvbDbQqfzLlozyCao2Z5uSGSQlcVXc+a
         A+jPisWN5DGB/9cYSmTgC2yU4lRuq2GjUVLRLjNkuIY4Q/6b2Og1G/EygdWd2cIo3SoO
         ZJBw==
Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
X-Gm-Message-State: AOAM531KED6t6KKvL1tVJ3pjMZ7sZ46OyfmbejMnx+YCWlzoPlOWlSNS
	sJZVytW2PKSpTw0OrGJBsKo=
X-Google-Smtp-Source: ABdhPJyKx0N1WiCWQ+DalAT0N40BrpCw/z7CAnA8gq2G74Zs5dNzDI0Mkewi1nnG3Y02lzK2LUS3XQ==
X-Received: by 2002:a05:6870:d502:: with SMTP id b2mr637840oan.280.1643982404050;
        Fri, 04 Feb 2022 05:46:44 -0800 (PST)
X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
Original-Received: by 2002:a05:6870:ac06:: with SMTP id kw6ls293513oab.4.gmail; Fri, 04
 Feb 2022 05:46:40 -0800 (PST)
X-Received: by 2002:a05:6870:7341:: with SMTP id r1mr398346oal.222.1643982400663;
        Fri, 04 Feb 2022 05:46:40 -0800 (PST)
In-Reply-To: <m2ilznbs8p.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
X-Original-Sender: HGSeliger-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
Precedence: list
Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
List-ID: <pandoc-discuss.googlegroups.com>
X-Google-Group-Id: 1007024079513
List-Post: <https://groups.google.com/group/pandoc-discuss/post>, <mailto:pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
List-Help: <https://groups.google.com/support/>, <mailto:pandoc-discuss+help-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
List-Archive: <https://groups.google.com/group/pandoc-discuss
List-Subscribe: <https://groups.google.com/group/pandoc-discuss/subscribe>, <mailto:pandoc-discuss+subscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
List-Unsubscribe: <mailto:googlegroups-manage+1007024079513+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>,
 <https://groups.google.com/group/pandoc-discuss/subscribe>
Xref: news.gmane.io gmane.text.pandoc:30119
Archived-At: <http://permalink.gmane.org/gmane.text.pandoc/30119>

------=_Part_2988_952800547.1643982400076
Content-Type: multipart/alternative; 
	boundary="----=_Part_2989_7782225.1643982400076"

------=_Part_2989_7782225.1643982400076
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi!

I just ran into the same problem and solved it somewhat manually, but it=20
worked well for a 200+ pages document.
Basic approach: unzip the docx file (these are zip-archives with a=20
different extension to the filename), then tweak it a bit to put a text=20
entry of the index we can later use in the file converted by Pandoc, re-zip=
=20
the beast and the use Pandoc to create a LaTeX file. With the below=20
scripts, the converted file would have a '++index{Index entry}' for each=20
entry. Of course, the '++index' needs to be find-and-replaced to '\index'.=
=20
Done.

Here the details (please excuse the Markdown, I copied from my personal=20
wiki). Hope this helps the one or other out there=E2=80=A6


# Preserving indices
Pandoc does not do indices. So to keep them, unzip the Word file, open=20
`document.xml` in Atom, and replace the index entries with the LaTeX=20
command.

First make all xml-tags stand on one line. And replace all index entries by=
=20
a LaTeX-command. I am using `++` instead of the `\` to make later=20
replacement in the TeX-file easier. _Of course, check after conversion in=
=20
Pandoc that the `++` needs to me manually turned into a `\` for the=20
index-commands to work._

Now the difficult part: any lines before and after a line starting with=20
`++index` need to be removed, from and until a line starting with something=
=20
else than `<`. And, there could be several index commands running into each=
=20
other without any normal text between.

So we pull out any index entry and make a `++index` out of it using `sed`,=
=20
which is dirty, but quick. Then perl is dropped onto the file to pull the=
=20
`++index` before any xml-tags, which puts it right behind the normal text,=
=20
where they should go later. Then we write the rest out as is, to make sure=
=20
all xml-tags Word kept open are properly closed. One small alteration: any=
=20
text in the Word index command is simply replaced by _FOO_, so it would be=
=20
easier to track if anything went wrong or these indices somehow pop up=20
again.

This can be achieved with the following perl-script saved in=20
WordIndex2LaTeX.pl (and made executable, so chmod +x WordIndex2LaTeX.pl):
```
#!/usr/bin/perl

# @Author: Hendrik G. Seliger
# @Date:   4 February 2022, 11:34 +01:00
# @Filename: WordIndex2LaTeX.pl
# @Last modified time: 4 February 2022, 11:36 +01:00
# @License: GPL3
# @Copyright: =C2=A9 Copyright 2022 by Hendrik G. Seliger

###
$keptlines=3D'';
$indexlines=3D'';

while ( <STDIN> ) {
        if ( ( $_ =3D~ /^</ ) || ($_ =3D~ /^XE / ) ) { # line with xml tag =
or=20
word index entry
                $keptlines .=3D $_; # save the line
        } elsif ( $_ =3D~ /^++index/ ) {
                # Found an index entry. Now, put the LaTeX command BEFORE=
=20
the
                # Word tag, so that the tags are correctly opened and=20
closed, but
                # the LaTeX command appears first
                $indexlines .=3D $_; # save the line
        } else { # normal text line, print all kept ones and current, erase=
=20
memory
                print $indexlines;
                print $keptlines;
                print $_;
                $keptlines=3D'';
                $indexlines=3D'';
        }
}
print $indexlines;
print $keptlines;
```

So hence the conversion is achieved with

```
mkdir D
cd D
unzip ../MyDoc.docx
cd word
cat document.xml|sed -E 's/>/>\n/g' |sed -E 's/(.)</\1\n</g' | sed -E=20
's/^XE &quot;(.*)&quot;/++index{\1}\nXE \&quot;FOO\&quot;/g' |=20
../../WordIndex2LaTeX.pl| tr -d '\n' >document2.xml
```

Back up `document.xml` and rename `document2.xml` to `document.xml`. Re-zip=
=20
the document
```
mv document.xml ../..
mv document2.xml document.xml
cd ..
zip -r ../D.docx *
cd ..
```


John MacFarlane schrieb am Montag, 30. August 2021 um 04:07:17 UTC+2:

>
> Sorry, indexes aren't supported.
>
> DJ Penton <jakep...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>
> > I am new to pandoc. I am under enormous time pressure to convert a docx=
=20
> > file to latex. This has worked beautifully except that alphabetical=20
> index=20
> > entries in the docx file do not seem to be preserved. I would have=20
> expected=20
> > a latex \index{} tag. Is there a way to do this?
> >
> > I apologise for asking a question that has probably been answered=20
> > repeatedly. I spent 15 minutes searching for an answer and didn't see=
=20
> one.=20
> > Probably I just missed it. I must continue with other work on the=20
> document=20
> > for now.
> >
> > Anyway, thanks in advance; be kind :-)
> >
> > --=20
> > You received this message because you are subscribed to the Google=20
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send=
=20
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit=20
> https://groups.google.com/d/msgid/pandoc-discuss/3da0bbf7-36e8-4511-9766-=
7777ec427133n%40googlegroups.com
> .
>

--=20
You received this message because you are subscribed to the Google Groups "=
pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/=
pandoc-discuss/360a19b9-7a61-4452-8a57-e72cd68f05a1n%40googlegroups.com.

------=_Part_2989_7782225.1643982400076
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div>Hi!</div><div><br></div><div>I just ran into the same problem and solv=
ed it somewhat manually, but it worked well for a 200+ pages document.</div=
><div>Basic approach: unzip the docx file (these are zip-archives with a di=
fferent extension to the filename), then tweak it a bit to put a text entry=
 of the index we can later use in the file converted by Pandoc, re-zip the =
beast and the use Pandoc to create a LaTeX file. With the below scripts, th=
e converted file would have a '++index{Index entry}' for each entry. Of cou=
rse, the '++index' needs to be find-and-replaced to '\index'. Done.</div><d=
iv><br></div><div>Here the details (please excuse the Markdown, I copied fr=
om my personal wiki). Hope this helps the one or other out there=E2=80=A6<b=
r></div><div></div><div><br></div><div><br></div><div># Preserving indices<=
br>Pandoc does not do indices. So to keep them, unzip the Word file, open `=
document.xml` in Atom, and replace the index entries with the LaTeX command=
.<br><br>First make all xml-tags stand on one line. And replace all index e=
ntries by a LaTeX-command. I am using `++` instead of the `\` to make later=
 replacement in the TeX-file easier. _Of course, check after conversion in =
Pandoc that the `++` needs to me manually turned into a `\` for the index-c=
ommands to work._<br><br>Now the difficult part: any lines before and after=
 a line starting with `++index` need to be removed, from and until a line s=
tarting with something else than `&lt;`. And, there could be several index =
commands running into each other without any normal text between.<br><br>So=
 we pull out any index entry and make a `++index` out of it using `sed`, wh=
ich is dirty, but quick. Then perl is dropped onto the file to pull the `++=
index` before any xml-tags, which puts it right behind the normal text, whe=
re they should go later. Then we write the rest out as is, to make sure all=
 xml-tags Word kept open are properly closed. One small alteration: any tex=
t in the Word index command is simply replaced by _FOO_, so it would be eas=
ier to track if anything went wrong or these indices somehow pop up again.<=
/div><div><br>This can be achieved with the following perl-script saved in =
WordIndex2LaTeX.pl (and made executable, so chmod +x WordIndex2LaTeX.pl):<b=
r>```<br>#!/usr/bin/perl<br><br># @Author: Hendrik G. Seliger<br># @Date: &=
nbsp; 4 February 2022, 11:34 +01:00<br># @Filename: WordIndex2LaTeX.pl<br>#=
 @Last modified time: 4 February 2022, 11:36 +01:00<br># @License: GPL3<br>=
# @Copyright: =C2=A9 Copyright 2022 by Hendrik G. Seliger<br><br>###</div><=
div>$keptlines=3D'';<br>$indexlines=3D'';<br><br>while ( &lt;STDIN&gt; ) {<=
br>&nbsp; &nbsp; &nbsp; &nbsp; if ( ( $_ =3D~ /^&lt;/ ) || ($_ =3D~ /^XE / =
) ) { # line with xml tag or word index entry<br>&nbsp; &nbsp; &nbsp; &nbsp=
; &nbsp; &nbsp; &nbsp; &nbsp; $keptlines .=3D $_; # save the line<br>&nbsp;=
 &nbsp; &nbsp; &nbsp; } elsif ( $_ =3D~ /^++index/ ) {<br>&nbsp; &nbsp; &nb=
sp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # Found an index entry. Now, put the=
 LaTeX command BEFORE the<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; # Word tag, so that the tags are correctly opened and closed, but=
<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # the LaTeX com=
mand appears first<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; $indexlines .=3D $_; # save the line<br>&nbsp; &nbsp; &nbsp; &nbsp; } el=
se { # normal text line, print all kept ones and current, erase memory<br>&=
nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; print $indexlines;<b=
r>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; print $keptlines;=
<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; print $_;<br>&n=
bsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $keptlines=3D'';<br>&=
nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $indexlines=3D'';<br=
>&nbsp; &nbsp; &nbsp; &nbsp; }<br>}<br>print $indexlines;<br>print $keptlin=
es;<br>```<br><br>So hence the conversion is achieved with<br><br>```<br>mk=
dir D<br>cd D<br>unzip ../MyDoc.docx<br>cd word<br>cat document.xml|sed -E =
's/&gt;/&gt;\n/g' |sed -E 's/(.)&lt;/\1\n&lt;/g' | sed -E 's/^XE &amp;quot;=
(.*)&amp;quot;/++index{\1}\nXE \&amp;quot;FOO\&amp;quot;/g' | ../../WordInd=
ex2LaTeX.pl| tr -d '\n' &gt;document2.xml<br>```<br><br>Back up `document.x=
ml` and rename `document2.xml` to `document.xml`. Re-zip the document<br>``=
`<br>mv document.xml ../..<br>mv document2.xml document.xml<br>cd ..<br>zip=
 -r ../D.docx *<br>cd ..<br>```<br></div><div><br></div><br><div class=3D"g=
mail_quote"><div dir=3D"auto" class=3D"gmail_attr">John MacFarlane schrieb =
am Montag, 30. August 2021 um 04:07:17 UTC+2:<br/></div><blockquote class=
=3D"gmail_quote" style=3D"margin: 0 0 0 0.8ex; border-left: 1px solid rgb(2=
04, 204, 204); padding-left: 1ex;">
<br>Sorry, indexes aren&#39;t supported.
<br>
<br>DJ Penton &lt;<a href data-email-masked rel=3D"nofollow">jakep...@gmail=
.com</a>&gt; writes:
<br>
<br>&gt; I am new to pandoc. I am under enormous time pressure to convert a=
 docx=20
<br>&gt; file to latex. This has worked beautifully except that alphabetica=
l index=20
<br>&gt; entries in the docx file do not seem to be preserved. I would have=
 expected=20
<br>&gt; a latex \index{} tag. Is there a way to do this?
<br>&gt;
<br>&gt; I apologise for asking a question that has probably been answered=
=20
<br>&gt; repeatedly. I spent 15 minutes searching for an answer and didn=
9;t see one.=20
<br>&gt; Probably I just missed it. I must continue with other work on the =
document=20
<br>&gt; for now.
<br>&gt;
<br>&gt; Anyway, thanks in advance; be kind :-)
<br>&gt;
<br>&gt; --=20
<br>&gt; You received this message because you are subscribed to the Google=
 Groups &quot;pandoc-discuss&quot; group.
<br>&gt; To unsubscribe from this group and stop receiving emails from it, =
send an email to <a href data-email-masked rel=3D"nofollow">pandoc-discus..=
.@googlegroups.com</a>.
<br>&gt; To view this discussion on the web visit <a href=3D"https://groups=
.google.com/d/msgid/pandoc-discuss/3da0bbf7-36e8-4511-9766-7777ec427133n%40=
googlegroups.com" target=3D"_blank" rel=3D"nofollow" data-saferedirecturl=
=3D"https://www.google.com/url?hl=3Dde&amp;q=3Dhttps://groups.google.com/d/=
msgid/pandoc-discuss/3da0bbf7-36e8-4511-9766-7777ec427133n%2540googlegroups=
.com&amp;source=3Dgmail&amp;ust=3D1644068218579000&amp;usg=3DAFQjCNHH8BOj5j=
yN-jBchaQ_DoImdjLfjg">https://groups.google.com/d/msgid/pandoc-discuss/3da0=
bbf7-36e8-4511-9766-7777ec427133n%40googlegroups.com</a>.
<br></blockquote></div>

<p></p>

-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;pandoc-discuss&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org">pand=
oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/d/msgid/pandoc-discuss/360a19b9-7a61-4452-8a57-e72cd68f05a1n%40googlegro=
ups.com?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.com/d=
/msgid/pandoc-discuss/360a19b9-7a61-4452-8a57-e72cd68f05a1n%40googlegroups.=
com</a>.<br />

------=_Part_2989_7782225.1643982400076--

------=_Part_2988_952800547.1643982400076--