From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/30120 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Hendrik Seliger Newsgroups: gmane.text.pandoc Subject: Re: Retaining index entries in docx file Date: Fri, 4 Feb 2022 05:58:38 -0800 (PST) Message-ID: <146889bd-867e-4b83-8b21-98bb01e559d7n@googlegroups.com> References: <3da0bbf7-36e8-4511-9766-7777ec427133n@googlegroups.com> <360a19b9-7a61-4452-8a57-e72cd68f05a1n@googlegroups.com> Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_1082_1717073160.1643983118908" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="16169"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBDR2BA73YEOBBD7C6SHQMGQEXU4C5GI-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Fri Feb 04 14:58:43 2022 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-oo1-f62.google.com ([209.85.161.62]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1nFz6o-00042P-Ts for gtp-pandoc-discuss@m.gmane-mx.org; Fri, 04 Feb 2022 14:58:42 +0100 Original-Received: by mail-oo1-f62.google.com with SMTP id s10-20020a4ab54a000000b002ea051bad32sf3309307ooo.14 for ; Fri, 04 Feb 2022 05:58:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20210112; h=sender:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=7yOW+y5zHf9hcBXn6e1+M1k8CnePnn2pREIKDxq6DxI=; b=CZ8AqMArQhXafO01PNGqMCl6/4Xpg9sLZH2Tyy55gWazQJ89gjbNRLWjuRjzZaJVnD uewEh0rjVIUfKvEdpbCHMP2e5mwlOMTKstCeZuADMvR1ebNrXR2Zha1JHXdMxdosSMIJ 1/tolX50615oTcqbon6pvi30MxZ4ErhDUexkzhe3AzEpk8Pw7PSTu6l7xT54dDCU1gV+ 4aHOaQAcnDdDHfRARulZKjmmaC+IZTLLfkmjKIMqWEjbnJZzW7NU5XQtO1CYFXAWYqQx hKUqHSSqk/LdYlnTfK6XE2CawNqb7rVUsnw/u8V+IrEQVL9OuUTAlf4AZXG4CcTB4mDt hUNw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=date:from:to:message-id:in-reply-to:references:subject:mime-version :x-original-sender:reply-to:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=7yOW+y5zHf9hcBXn6e1+M1k8CnePnn2pREIKDxq6DxI=; b=PPU0TkbVbtSE2G8lW3GKgEkCuuCzjZ0sR8euMt1UBX7SZddMxA5ag0rXATf7iGLCUA /z0zMp+SuRXeezVoYYIpgQIBeAg47KivbJpEWkWXqJWNqSAUBtwEqNMBRV5Qj9ZFqx08 ZSRaz6Wquy9hv8Xk0rrumlQOYnWcE9unpVm5VdjsntVxSZ1QnYp/uvvRKr6Xn/FwvKLW y2367Pgb4+ohqn0zG8gKPzIAXvDMFvkv6gqVGARHEvBwTjvw6LFc52nDU5sMdYnL4VqE UVEypTV4qIuhk4lv/F98ecxG6NzHl45XlW7VAT0TLQWKTKZcbn4qeh8f3Lksbg4oJM6x YTOw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to :references:subject:mime-version:x-original-sender:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=7yOW+y5zHf9hcBXn6e1+M1k8CnePnn2pREIKDxq6DxI=; b=IpoQ1Nq/w4IoL8pVOcbA6Pw0c4bleVZit7agQXXNF7xGPg+JxZrl04i4Uup5ynk0Nh Y3Oxw5DvZyXlQXwhes0EdbgYcOjlUzBUeKV8MxCz59aduFDlx2Hv7DOmET4DAaGIyVj+ V1dFQNNew0CVJkVwfSQcVZCDDi0bkmrs/fVEnz8N3h6IoOICvj1W7qpEv3icl+ItwVDj S0yfT2Gd8z/bTUzslDjjDAK+PF6tfgmBUyIuRUkAk/Z8krzD2CvlbYx36GzAiRoc9wvu FFYhu1lMef/WqXxAwDf/j1B+IX4Th7d8/+RXXtHh5GTuLKJrxRRtgZzsadKQGA+sOQ3n iWmg== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AOAM530nxPGmEwi+wWuNNTMXijHulGi7ctEJ8fgUHza/0HtbGrmLR37E pMLLTamZu0jNVb8cEYe4XK0= X-Google-Smtp-Source: ABdhPJwRUU0J3o+djY2wFSxdSBpgLgcI6WcWsZUX8YmpSmZQEv6VeReiIT1R8PhXbTRDWCZCRaNFBA== X-Received: by 2002:a9d:a36:: with SMTP id 51mr1115831otg.21.1643983121698; Fri, 04 Feb 2022 05:58:41 -0800 (PST) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a05:6870:d583:: with SMTP id u3ls301023oao.1.gmail; Fri, 04 Feb 2022 05:58:39 -0800 (PST) X-Received: by 2002:a05:6871:583:: with SMTP id u3mr634747oan.189.1643983119309; Fri, 04 Feb 2022 05:58:39 -0800 (PST) In-Reply-To: <360a19b9-7a61-4452-8a57-e72cd68f05a1n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> X-Original-Sender: HGSeliger-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:30120 Archived-At: ------=_Part_1082_1717073160.1643983118908 Content-Type: multipart/alternative; boundary="----=_Part_1083_1027692674.1643983118908" ------=_Part_1083_1027692674.1643983118908 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable BTW, just see thatP Pandoc of course also escapes the {} after index, so=20 those need to be fixed as well. Easiest by pushing through sed, too: pandoc D.docx -t latex | sed -E 's/\+\+index\\\{(.*?)\\\}/\\index{\1}/g'=20 >D.tex Hendrik Seliger schrieb am Freitag, 4. Februar 2022 um 14:46:40 UTC+1: > Hi! > > I just ran into the same problem and solved it somewhat manually, but it= =20 > worked well for a 200+ pages document. > Basic approach: unzip the docx file (these are zip-archives with a=20 > different extension to the filename), then tweak it a bit to put a text= =20 > entry of the index we can later use in the file converted by Pandoc, re-z= ip=20 > the beast and the use Pandoc to create a LaTeX file. With the below=20 > scripts, the converted file would have a '++index{Index entry}' for each= =20 > entry. Of course, the '++index' needs to be find-and-replaced to '\index'= .=20 > Done. > > Here the details (please excuse the Markdown, I copied from my personal= =20 > wiki). Hope this helps the one or other out there=E2=80=A6 > > > # Preserving indices > Pandoc does not do indices. So to keep them, unzip the Word file, open=20 > `document.xml` in Atom, and replace the index entries with the LaTeX=20 > command. > > First make all xml-tags stand on one line. And replace all index entries= =20 > by a LaTeX-command. I am using `++` instead of the `\` to make later=20 > replacement in the TeX-file easier. _Of course, check after conversion in= =20 > Pandoc that the `++` needs to me manually turned into a `\` for the=20 > index-commands to work._ > > Now the difficult part: any lines before and after a line starting with= =20 > `++index` need to be removed, from and until a line starting with somethi= ng=20 > else than `<`. And, there could be several index commands running into ea= ch=20 > other without any normal text between. > > So we pull out any index entry and make a `++index` out of it using `sed`= ,=20 > which is dirty, but quick. Then perl is dropped onto the file to pull the= =20 > `++index` before any xml-tags, which puts it right behind the normal text= ,=20 > where they should go later. Then we write the rest out as is, to make sur= e=20 > all xml-tags Word kept open are properly closed. One small alteration: an= y=20 > text in the Word index command is simply replaced by _FOO_, so it would b= e=20 > easier to track if anything went wrong or these indices somehow pop up=20 > again. > > This can be achieved with the following perl-script saved in=20 > WordIndex2LaTeX.pl (and made executable, so chmod +x WordIndex2LaTeX.pl): > ``` > #!/usr/bin/perl > > # @Author: Hendrik G. Seliger > # @Date: 4 February 2022, 11:34 +01:00 > # @Filename: WordIndex2LaTeX.pl > # @Last modified time: 4 February 2022, 11:36 +01:00 > # @License: GPL3 > # @Copyright: =C2=A9 Copyright 2022 by Hendrik G. Seliger > > ### > $keptlines=3D''; > $indexlines=3D''; > > while ( ) { > if ( ( $_ =3D~ /^ word index entry > $keptlines .=3D $_; # save the line > } elsif ( $_ =3D~ /^++index/ ) { > # Found an index entry. Now, put the LaTeX command BEFORE= =20 > the > # Word tag, so that the tags are correctly opened and=20 > closed, but > # the LaTeX command appears first > $indexlines .=3D $_; # save the line > } else { # normal text line, print all kept ones and current,=20 > erase memory > print $indexlines; > print $keptlines; > print $_; > $keptlines=3D''; > $indexlines=3D''; > } > } > print $indexlines; > print $keptlines; > ``` > > So hence the conversion is achieved with > > ``` > mkdir D > cd D > unzip ../MyDoc.docx > cd word > cat document.xml|sed -E 's/>/>\n/g' |sed -E 's/(.) 's/^XE "(.*)"/++index{\1}\nXE \"FOO\"/g' |=20 > ../../WordIndex2LaTeX.pl| tr -d '\n' >document2.xml > ``` > > Back up `document.xml` and rename `document2.xml` to `document.xml`.=20 > Re-zip the document > ``` > mv document.xml ../.. > mv document2.xml document.xml > cd .. > zip -r ../D.docx * > cd .. > ``` > > > John MacFarlane schrieb am Montag, 30. August 2021 um 04:07:17 UTC+2: > >> >> Sorry, indexes aren't supported.=20 >> >> DJ Penton writes:=20 >> >> > I am new to pandoc. I am under enormous time pressure to convert a doc= x=20 >> > file to latex. This has worked beautifully except that alphabetical=20 >> index=20 >> > entries in the docx file do not seem to be preserved. I would have=20 >> expected=20 >> > a latex \index{} tag. Is there a way to do this?=20 >> >=20 >> > I apologise for asking a question that has probably been answered=20 >> > repeatedly. I spent 15 minutes searching for an answer and didn't see= =20 >> one.=20 >> > Probably I just missed it. I must continue with other work on the=20 >> document=20 >> > for now.=20 >> >=20 >> > Anyway, thanks in advance; be kind :-)=20 >> >=20 >> > --=20 >> > You received this message because you are subscribed to the Google=20 >> Groups "pandoc-discuss" group.=20 >> > To unsubscribe from this group and stop receiving emails from it, send= =20 >> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org=20 >> > To view this discussion on the web visit=20 >> https://groups.google.com/d/msgid/pandoc-discuss/3da0bbf7-36e8-4511-9766= -7777ec427133n%40googlegroups.com.=20 >> >> > --=20 You received this message because you are subscribed to the Google Groups "= pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/= pandoc-discuss/146889bd-867e-4b83-8b21-98bb01e559d7n%40googlegroups.com. ------=_Part_1083_1027692674.1643983118908 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
BTW, just see thatP Pandoc of course also escapes the {} after index, = so those need to be fixed as well. Easiest by pushing through sed, too:

pandoc D.docx -t latex | sed -E 's/\+\+index\\\{(.*?)= \\\}/\\index{\1}/g' >D.tex

Hendrik Seliger schrieb am Freitag, 4. Fe= bruar 2022 um 14:46:40 UTC+1:
Hi!

I just ran into the same= problem and solved it somewhat manually, but it worked well for a 200+ pag= es document.
Basic approach: unzip the docx file (these are zip-a= rchives with a different extension to the filename), then tweak it a bit to= put a text entry of the index we can later use in the file converted by Pa= ndoc, re-zip the beast and the use Pandoc to create a LaTeX file. With the = below scripts, the converted file would have a '++index{Index entry}= 9; for each entry. Of course, the '++index' needs to be find-and-re= placed to '\index'. Done.

Here the details= (please excuse the Markdown, I copied from my personal wiki). Hope this he= lps the one or other out there=E2=80=A6

=

# Preserving indices
Pandoc does not do indices. So = to keep them, unzip the Word file, open `document.xml` in Atom, and replace= the index entries with the LaTeX command.

First make all xml-tags s= tand on one line. And replace all index entries by a LaTeX-command. I am us= ing `++` instead of the `\` to make later replacement in the TeX-file easie= r. _Of course, check after conversion in Pandoc that the `++` needs to me m= anually turned into a `\` for the index-commands to work._

Now the d= ifficult part: any lines before and after a line starting with `++index` ne= ed to be removed, from and until a line starting with something else than `= <`. And, there could be several index commands running into each other w= ithout any normal text between.

So we pull out any index entry and m= ake a `++index` out of it using `sed`, which is dirty, but quick. Then perl= is dropped onto the file to pull the `++index` before any xml-tags, which = puts it right behind the normal text, where they should go later. Then we w= rite the rest out as is, to make sure all xml-tags Word kept open are prope= rly closed. One small alteration: any text in the Word index command is sim= ply replaced by _FOO_, so it would be easier to track if anything went wron= g or these indices somehow pop up again.

This can be achieved= with the following perl-script saved in WordIndex2LaTeX.pl (and made execu= table, so chmod +x WordIndex2LaTeX.pl):
```
#!/usr/bin/perl

# = @Author: Hendrik G. Seliger
# @Date: =C2=A0 4 February 2022, 11:34 +01:0= 0
# @Filename: WordIndex2LaTeX.pl
# @Last modified time: 4 February 2= 022, 11:36 +01:00
# @License: GPL3
# @Copyright: =C2=A9 Copyright 202= 2 by Hendrik G. Seliger

###
$keptlines=3D'';
$i= ndexlines=3D'';

while ( <STDIN> ) {
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 if ( ( $_ =3D~ /^</ ) || ($_ =3D~ /^XE / ) ) { # line with= xml tag or word index entry
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 $keptlines .=3D $_; # save the line
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 } elsif ( $_ =3D~ /^++index/ ) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 # Found an index entry. Now, put the LaTeX command= BEFORE the
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 # Wo= rd tag, so that the tags are correctly opened and closed, but
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 # the LaTeX command appears f= irst
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 $indexlines= .=3D $_; # save the line
=C2=A0 =C2=A0 =C2=A0 =C2=A0 } else { # normal = text line, print all kept ones and current, erase memory
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 print $indexlines;
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 print $keptlines;
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 print $_;
=C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 $keptlines=3D'';
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 $indexlines=3D'= 9;;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
}
print $indexlines;
print $k= eptlines;
```

So hence the conversion is achieved with

```=
mkdir D
cd D
unzip ../MyDoc.docx
cd word
cat document.xml|s= ed -E 's/>/>\n/g' |sed -E 's/(.)</\1\n</g' | se= d -E 's/^XE &quot;(.*)&quot;/++index{\1}\nXE \&quot;FOO\&am= p;quot;/g' | ../../WordIndex2LaTeX.pl| tr -d '\n' >document2= .xml
```

Back up `document.xml` and rename `document2.xml` to `do= cument.xml`. Re-zip the document
```
mv document.xml ../..
mv docu= ment2.xml document.xml
cd ..
zip -r ../D.docx *
cd ..
```


John MacFarlane schrieb am Montag, 30. August 2021 um 04:07= :17 UTC+2:

Sorry, indexes aren't supported.

DJ Penton <jakep...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> I am new to pandoc. I am under enormous time pressure to convert a= docx=20
> file to latex. This has worked beautifully except that alphabetica= l index=20
> entries in the docx file do not seem to be preserved. I would have= expected=20
> a latex \index{} tag. Is there a way to do this?
>
> I apologise for asking a question that has probably been answered= =20
> repeatedly. I spent 15 minutes searching for an answer and didn= 9;t see one.=20
> Probably I just missed it. I must continue with other work on the = document=20
> for now.
>
> Anyway, thanks in advance; be kind :-)
>
> --=20
> You received this message because you are subscribed to the Google= Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, = send an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/3da0= bbf7-36e8-4511-9766-7777ec427133n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d= /msgid/pandoc-discuss/146889bd-867e-4b83-8b21-98bb01e559d7n%40googlegroups.= com.
------=_Part_1083_1027692674.1643983118908-- ------=_Part_1082_1717073160.1643983118908--