From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/30119 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Hendrik Seliger Newsgroups: gmane.text.pandoc Subject: Re: Retaining index entries in docx file Date: Fri, 4 Feb 2022 05:46:40 -0800 (PST) Message-ID: <360a19b9-7a61-4452-8a57-e72cd68f05a1n@googlegroups.com> References: <3da0bbf7-36e8-4511-9766-7777ec427133n@googlegroups.com> Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_2988_952800547.1643982400076" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="35711"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBDR2BA73YEOBBQO46SHQMGQETLERCEY-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Fri Feb 04 14:46:45 2022 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-ot1-f61.google.com ([209.85.210.61]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1nFyvF-00093q-4M for gtp-pandoc-discuss@m.gmane-mx.org; Fri, 04 Feb 2022 14:46:45 +0100 Original-Received: by mail-ot1-f61.google.com with SMTP id a4-20020a9d5c84000000b005a1daff4564sf1712842oti.2 for ; Fri, 04 Feb 2022 05:46:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20210112; h=sender:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=sQUmdFwxp7H5UQMpwFJQbvy5awLgUXuGlphxchuraSY=; b=bXuI0Z2i1OvHNIJTFWMCILo67CpFRB86W7ZXVHClzwmDNqULUjF2aIaoX9kjchMwm5 33VXfV7zvM1Mb8wP1G0MxhkR9rJCy1m6ydZh50TqgMrx/wFTO6hryKIWyN9LZa11MqxP 4PzuDRNZM0Ivd2skx76rIKx5yD4T+VPNOjntK0IX/p9W6PMY+/x5VZG4cjquEVod0AGk W7fRI84aD0WxK2AU3G8vYrkUK+O/IzKPu16nU6bPxyLVhV440yfbHMWf4NPrT1O8xxwE 75u4xUOHUwXWTlN5tyZGFtsAYl463tRCFtwXV5IkRWxqrpUx7BIM8/ve9LQpz5E+iqCS q/JQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=date:from:to:message-id:in-reply-to:references:subject:mime-version :x-original-sender:reply-to:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=sQUmdFwxp7H5UQMpwFJQbvy5awLgUXuGlphxchuraSY=; b=bv9X7lxrU8n9MwdHHzwfVakk4KFsWj8qhuChK5uVHUcKLPwV5zPK5SVIzpdNVLsZby mZvb66IkTZ1XB1xd3Onv6TdWx3ncocOG6l8gHKLM8u/gWpsMMw5wo2GL0QD/CQGIkk+k fk+zzgZFMyu7vagbI7IJx81Jrv0KDBH7W/RhgYzmupFtmSzJFoHjAPdlGaw9t7WLDGAX f7qQMstDBQGXBxHkYWAz7D3gSTNkgRbiyuKkikyxs+Ee+3yyjmkIucDgUWNXeyzpbAR1 9b8tdN8DpLRhZvQrvi/EwNM8E6nhQSilN1YyPbCIaufIenPrZdoXqqOk4vZNqxtTKlp2 r+tg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to :references:subject:mime-version:x-original-sender:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=sQUmdFwxp7H5UQMpwFJQbvy5awLgUXuGlphxchuraSY=; b=cvhhojZisnGyxTcoo113HuzhrbQTuE0iinhMqdEtY7XDuho5td7pD/n0kJRgYLxZAT jSOznEOXF8EafeQhTlBKf6ysHHUk2QGc5iHyiM22Ajz8N7b34X8vUxL1XqeXICkSj3E6 RAz8eZ9Xu/V194jHxqTUVeHygGK4Yu9S1f5eq+vZdZ+yGRp8S0WSAhHVDBja4ZEP8Kfl I3prinYMF7+X6tXLapoVx5kNHhTEFgW3LmSSdvbDbQqfzLlozyCao2Z5uSGSQlcVXc+a A+jPisWN5DGB/9cYSmTgC2yU4lRuq2GjUVLRLjNkuIY4Q/6b2Og1G/EygdWd2cIo3SoO ZJBw== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AOAM531KED6t6KKvL1tVJ3pjMZ7sZ46OyfmbejMnx+YCWlzoPlOWlSNS sJZVytW2PKSpTw0OrGJBsKo= X-Google-Smtp-Source: ABdhPJyKx0N1WiCWQ+DalAT0N40BrpCw/z7CAnA8gq2G74Zs5dNzDI0Mkewi1nnG3Y02lzK2LUS3XQ== X-Received: by 2002:a05:6870:d502:: with SMTP id b2mr637840oan.280.1643982404050; Fri, 04 Feb 2022 05:46:44 -0800 (PST) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a05:6870:ac06:: with SMTP id kw6ls293513oab.4.gmail; Fri, 04 Feb 2022 05:46:40 -0800 (PST) X-Received: by 2002:a05:6870:7341:: with SMTP id r1mr398346oal.222.1643982400663; Fri, 04 Feb 2022 05:46:40 -0800 (PST) In-Reply-To: X-Original-Sender: HGSeliger-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:30119 Archived-At: ------=_Part_2988_952800547.1643982400076 Content-Type: multipart/alternative; boundary="----=_Part_2989_7782225.1643982400076" ------=_Part_2989_7782225.1643982400076 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi! I just ran into the same problem and solved it somewhat manually, but it=20 worked well for a 200+ pages document. Basic approach: unzip the docx file (these are zip-archives with a=20 different extension to the filename), then tweak it a bit to put a text=20 entry of the index we can later use in the file converted by Pandoc, re-zip= =20 the beast and the use Pandoc to create a LaTeX file. With the below=20 scripts, the converted file would have a '++index{Index entry}' for each=20 entry. Of course, the '++index' needs to be find-and-replaced to '\index'.= =20 Done. Here the details (please excuse the Markdown, I copied from my personal=20 wiki). Hope this helps the one or other out there=E2=80=A6 # Preserving indices Pandoc does not do indices. So to keep them, unzip the Word file, open=20 `document.xml` in Atom, and replace the index entries with the LaTeX=20 command. First make all xml-tags stand on one line. And replace all index entries by= =20 a LaTeX-command. I am using `++` instead of the `\` to make later=20 replacement in the TeX-file easier. _Of course, check after conversion in= =20 Pandoc that the `++` needs to me manually turned into a `\` for the=20 index-commands to work._ Now the difficult part: any lines before and after a line starting with=20 `++index` need to be removed, from and until a line starting with something= =20 else than `<`. And, there could be several index commands running into each= =20 other without any normal text between. So we pull out any index entry and make a `++index` out of it using `sed`,= =20 which is dirty, but quick. Then perl is dropped onto the file to pull the= =20 `++index` before any xml-tags, which puts it right behind the normal text,= =20 where they should go later. Then we write the rest out as is, to make sure= =20 all xml-tags Word kept open are properly closed. One small alteration: any= =20 text in the Word index command is simply replaced by _FOO_, so it would be= =20 easier to track if anything went wrong or these indices somehow pop up=20 again. This can be achieved with the following perl-script saved in=20 WordIndex2LaTeX.pl (and made executable, so chmod +x WordIndex2LaTeX.pl): ``` #!/usr/bin/perl # @Author: Hendrik G. Seliger # @Date: 4 February 2022, 11:34 +01:00 # @Filename: WordIndex2LaTeX.pl # @Last modified time: 4 February 2022, 11:36 +01:00 # @License: GPL3 # @Copyright: =C2=A9 Copyright 2022 by Hendrik G. Seliger ### $keptlines=3D''; $indexlines=3D''; while ( ) { if ( ( $_ =3D~ /^/>\n/g' |sed -E 's/(.)document2.xml ``` Back up `document.xml` and rename `document2.xml` to `document.xml`. Re-zip= =20 the document ``` mv document.xml ../.. mv document2.xml document.xml cd .. zip -r ../D.docx * cd .. ``` John MacFarlane schrieb am Montag, 30. August 2021 um 04:07:17 UTC+2: > > Sorry, indexes aren't supported. > > DJ Penton writes: > > > I am new to pandoc. I am under enormous time pressure to convert a docx= =20 > > file to latex. This has worked beautifully except that alphabetical=20 > index=20 > > entries in the docx file do not seem to be preserved. I would have=20 > expected=20 > > a latex \index{} tag. Is there a way to do this? > > > > I apologise for asking a question that has probably been answered=20 > > repeatedly. I spent 15 minutes searching for an answer and didn't see= =20 > one.=20 > > Probably I just missed it. I must continue with other work on the=20 > document=20 > > for now. > > > > Anyway, thanks in advance; be kind :-) > > > > --=20 > > You received this message because you are subscribed to the Google=20 > Groups "pandoc-discuss" group. > > To unsubscribe from this group and stop receiving emails from it, send= =20 > an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To view this discussion on the web visit=20 > https://groups.google.com/d/msgid/pandoc-discuss/3da0bbf7-36e8-4511-9766-= 7777ec427133n%40googlegroups.com > . > --=20 You received this message because you are subscribed to the Google Groups "= pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/= pandoc-discuss/360a19b9-7a61-4452-8a57-e72cd68f05a1n%40googlegroups.com. ------=_Part_2989_7782225.1643982400076 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi!

I just ran into the same problem and solv= ed it somewhat manually, but it worked well for a 200+ pages document.
Basic approach: unzip the docx file (these are zip-archives with a di= fferent extension to the filename), then tweak it a bit to put a text entry= of the index we can later use in the file converted by Pandoc, re-zip the = beast and the use Pandoc to create a LaTeX file. With the below scripts, th= e converted file would have a '++index{Index entry}' for each entry. Of cou= rse, the '++index' needs to be find-and-replaced to '\index'. Done.

Here the details (please excuse the Markdown, I copied fr= om my personal wiki). Hope this helps the one or other out there=E2=80=A6


# Preserving indices<= br>Pandoc does not do indices. So to keep them, unzip the Word file, open `= document.xml` in Atom, and replace the index entries with the LaTeX command= .

First make all xml-tags stand on one line. And replace all index e= ntries by a LaTeX-command. I am using `++` instead of the `\` to make later= replacement in the TeX-file easier. _Of course, check after conversion in = Pandoc that the `++` needs to me manually turned into a `\` for the index-c= ommands to work._

Now the difficult part: any lines before and after= a line starting with `++index` need to be removed, from and until a line s= tarting with something else than `<`. And, there could be several index = commands running into each other without any normal text between.

So= we pull out any index entry and make a `++index` out of it using `sed`, wh= ich is dirty, but quick. Then perl is dropped onto the file to pull the `++= index` before any xml-tags, which puts it right behind the normal text, whe= re they should go later. Then we write the rest out as is, to make sure all= xml-tags Word kept open are properly closed. One small alteration: any tex= t in the Word index command is simply replaced by _FOO_, so it would be eas= ier to track if anything went wrong or these indices somehow pop up again.<= /div>

This can be achieved with the following perl-script saved in = WordIndex2LaTeX.pl (and made executable, so chmod +x WordIndex2LaTeX.pl):```
#!/usr/bin/perl

# @Author: Hendrik G. Seliger
# @Date: &= nbsp; 4 February 2022, 11:34 +01:00
# @Filename: WordIndex2LaTeX.pl
#= @Last modified time: 4 February 2022, 11:36 +01:00
# @License: GPL3
= # @Copyright: =C2=A9 Copyright 2022 by Hendrik G. Seliger

###
<= div>$keptlines=3D'';
$indexlines=3D'';

while ( <STDIN> ) {<= br>        if ( ( $_ =3D~ /^</ ) || ($_ =3D~ /^XE / = ) ) { # line with xml tag or word index entry
       = ;         $keptlines .=3D $_; # save the line
 =       } elsif ( $_ =3D~ /^++index/ ) {
    &nb= sp;           # Found an index entry. Now, put the= LaTeX command BEFORE the
            &nbs= p;   # Word tag, so that the tags are correctly opened and closed, but=
                # the LaTeX com= mand appears first
              &nbs= p; $indexlines .=3D $_; # save the line
        } el= se { # normal text line, print all kept ones and current, erase memory
&= nbsp;               print $indexlines;                print $keptlines;=
                print $_;
&n= bsp;               $keptlines=3D'';
&= nbsp;               $indexlines=3D'';        }
}
print $indexlines;
print $keptlin= es;
```

So hence the conversion is achieved with

```
mk= dir D
cd D
unzip ../MyDoc.docx
cd word
cat document.xml|sed -E = 's/>/>\n/g' |sed -E 's/(.)</\1\n</g' | sed -E 's/^XE &quot;= (.*)&quot;/++index{\1}\nXE \&quot;FOO\&quot;/g' | ../../WordInd= ex2LaTeX.pl| tr -d '\n' >document2.xml
```

Back up `document.x= ml` and rename `document2.xml` to `document.xml`. Re-zip the document
``= `
mv document.xml ../..
mv document2.xml document.xml
cd ..
zip= -r ../D.docx *
cd ..
```


John MacFarlane schrieb = am Montag, 30. August 2021 um 04:07:17 UTC+2:

Sorry, indexes aren't supported.

DJ Penton <jakep...@gmail= .com> writes:

> I am new to pandoc. I am under enormous time pressure to convert a= docx=20
> file to latex. This has worked beautifully except that alphabetica= l index=20
> entries in the docx file do not seem to be preserved. I would have= expected=20
> a latex \index{} tag. Is there a way to do this?
>
> I apologise for asking a question that has probably been answered= =20
> repeatedly. I spent 15 minutes searching for an answer and didn= 9;t see one.=20
> Probably I just missed it. I must continue with other work on the = document=20
> for now.
>
> Anyway, thanks in advance; be kind :-)
>
> --=20
> You received this message because you are subscribed to the Google= Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, = send an email to pandoc-discus..= .@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/3da0= bbf7-36e8-4511-9766-7777ec427133n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d= /msgid/pandoc-discuss/360a19b9-7a61-4452-8a57-e72cd68f05a1n%40googlegroups.= com.
------=_Part_2989_7782225.1643982400076-- ------=_Part_2988_952800547.1643982400076--