Retaining index entries in docx file

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* Retaining index entries in docx file
@ 2021-08-29 17:59 DJ Penton
       [not found] ` <3da0bbf7-36e8-4511-9766-7777ec427133n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: DJ Penton @ 2021-08-29 17:59 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1.1: Type: text/plain, Size: 954 bytes --]

I am new to pandoc. I am under enormous time pressure to convert a docx 
file to latex. This has worked beautifully except that alphabetical index 
entries in the docx file do not seem to be preserved. I would have expected 
a latex \index{} tag. Is there a way to do this?

I apologise for asking a question that has probably been answered 
repeatedly. I spent 15 minutes searching for an answer and didn't see one. 
Probably I just missed it. I must continue with other work on the document 
for now.

Anyway, thanks in advance; be kind :-)

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/3da0bbf7-36e8-4511-9766-7777ec427133n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 1281 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Retaining index entries in docx file
       [not found] ` <3da0bbf7-36e8-4511-9766-7777ec427133n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2021-08-30  2:07   ` John MacFarlane
       [not found]     ` <m2ilznbs8p.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: John MacFarlane @ 2021-08-30  2:07 UTC (permalink / raw)
  To: DJ Penton, pandoc-discuss


Sorry, indexes aren't supported.

DJ Penton <jakepenton-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> I am new to pandoc. I am under enormous time pressure to convert a docx 
> file to latex. This has worked beautifully except that alphabetical index 
> entries in the docx file do not seem to be preserved. I would have expected 
> a latex \index{} tag. Is there a way to do this?
>
> I apologise for asking a question that has probably been answered 
> repeatedly. I spent 15 minutes searching for an answer and didn't see one. 
> Probably I just missed it. I must continue with other work on the document 
> for now.
>
> Anyway, thanks in advance; be kind :-)
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/3da0bbf7-36e8-4511-9766-7777ec427133n%40googlegroups.com.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Retaining index entries in docx file
       [not found]     ` <m2ilznbs8p.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
@ 2022-02-04 13:46       ` Hendrik Seliger
       [not found]         ` <360a19b9-7a61-4452-8a57-e72cd68f05a1n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Hendrik Seliger @ 2022-02-04 13:46 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1.1: Type: text/plain, Size: 5388 bytes --]

Hi!

I just ran into the same problem and solved it somewhat manually, but it 
worked well for a 200+ pages document.
Basic approach: unzip the docx file (these are zip-archives with a 
different extension to the filename), then tweak it a bit to put a text 
entry of the index we can later use in the file converted by Pandoc, re-zip 
the beast and the use Pandoc to create a LaTeX file. With the below 
scripts, the converted file would have a '++index{Index entry}' for each 
entry. Of course, the '++index' needs to be find-and-replaced to '\index'. 
Done.

Here the details (please excuse the Markdown, I copied from my personal 
wiki). Hope this helps the one or other out there…

# Preserving indices
Pandoc does not do indices. So to keep them, unzip the Word file, open 
`document.xml` in Atom, and replace the index entries with the LaTeX 
command.

First make all xml-tags stand on one line. And replace all index entries by 
a LaTeX-command. I am using `++` instead of the `\` to make later 
replacement in the TeX-file easier. _Of course, check after conversion in 
Pandoc that the `++` needs to me manually turned into a `\` for the 
index-commands to work._

Now the difficult part: any lines before and after a line starting with 
`++index` need to be removed, from and until a line starting with something 
else than `<`. And, there could be several index commands running into each 
other without any normal text between.

So we pull out any index entry and make a `++index` out of it using `sed`, 
which is dirty, but quick. Then perl is dropped onto the file to pull the 
`++index` before any xml-tags, which puts it right behind the normal text, 
where they should go later. Then we write the rest out as is, to make sure 
all xml-tags Word kept open are properly closed. One small alteration: any 
text in the Word index command is simply replaced by _FOO_, so it would be 
easier to track if anything went wrong or these indices somehow pop up 
again.

This can be achieved with the following perl-script saved in 
WordIndex2LaTeX.pl (and made executable, so chmod +x WordIndex2LaTeX.pl):
```
#!/usr/bin/perl

# @Author: Hendrik G. Seliger
# @Date:   4 February 2022, 11:34 +01:00
# @Filename: WordIndex2LaTeX.pl
# @Last modified time: 4 February 2022, 11:36 +01:00
# @License: GPL3
# @Copyright: © Copyright 2022 by Hendrik G. Seliger

###
$keptlines='';
$indexlines='';

while ( <STDIN> ) {
        if ( ( $_ =~ /^</ ) || ($_ =~ /^XE / ) ) { # line with xml tag or 
word index entry
                $keptlines .= $_; # save the line
        } elsif ( $_ =~ /^++index/ ) {
                # Found an index entry. Now, put the LaTeX command BEFORE 
the
                # Word tag, so that the tags are correctly opened and 
closed, but
                # the LaTeX command appears first
                $indexlines .= $_; # save the line
        } else { # normal text line, print all kept ones and current, erase 
memory
                print $indexlines;
                print $keptlines;
                print $_;
                $keptlines='';
                $indexlines='';
        }
}
print $indexlines;
print $keptlines;
```

So hence the conversion is achieved with

```
mkdir D
cd D
unzip ../MyDoc.docx
cd word
cat document.xml|sed -E 's/>/>\n/g' |sed -E 's/(.)</\1\n</g' | sed -E 
's/^XE &quot;(.*)&quot;/++index{\1}\nXE \&quot;FOO\&quot;/g' | 
../../WordIndex2LaTeX.pl| tr -d '\n' >document2.xml
```

Back up `document.xml` and rename `document2.xml` to `document.xml`. Re-zip 
the document
```
mv document.xml ../..
mv document2.xml document.xml
cd ..
zip -r ../D.docx *
cd ..
```

John MacFarlane schrieb am Montag, 30. August 2021 um 04:07:17 UTC+2:

>
> Sorry, indexes aren't supported.
>
> DJ Penton <jakep...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>
> > I am new to pandoc. I am under enormous time pressure to convert a docx 
> > file to latex. This has worked beautifully except that alphabetical 
> index 
> > entries in the docx file do not seem to be preserved. I would have 
> expected 
> > a latex \index{} tag. Is there a way to do this?
> >
> > I apologise for asking a question that has probably been answered 
> > repeatedly. I spent 15 minutes searching for an answer and didn't see 
> one. 
> > Probably I just missed it. I must continue with other work on the 
> document 
> > for now.
> >
> > Anyway, thanks in advance; be kind :-)
> >
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/3da0bbf7-36e8-4511-9766-7777ec427133n%40googlegroups.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/360a19b9-7a61-4452-8a57-e72cd68f05a1n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 7125 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Retaining index entries in docx file
       [not found]         ` <360a19b9-7a61-4452-8a57-e72cd68f05a1n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2022-02-04 13:58           ` Hendrik Seliger
  2022-02-04 14:38           ` BPJ
  1 sibling, 0 replies; 5+ messages in thread
From: Hendrik Seliger @ 2022-02-04 13:58 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 5932 bytes --]

BTW, just see thatP Pandoc of course also escapes the {} after index, so 
those need to be fixed as well. Easiest by pushing through sed, too:

pandoc D.docx -t latex | sed -E 's/\+\+index\\\{(.*?)\\\}/\\index{\1}/g' 
>D.tex

Hendrik Seliger schrieb am Freitag, 4. Februar 2022 um 14:46:40 UTC+1:

> Hi!
>
> I just ran into the same problem and solved it somewhat manually, but it 
> worked well for a 200+ pages document.
> Basic approach: unzip the docx file (these are zip-archives with a 
> different extension to the filename), then tweak it a bit to put a text 
> entry of the index we can later use in the file converted by Pandoc, re-zip 
> the beast and the use Pandoc to create a LaTeX file. With the below 
> scripts, the converted file would have a '++index{Index entry}' for each 
> entry. Of course, the '++index' needs to be find-and-replaced to '\index'. 
> Done.
>
> Here the details (please excuse the Markdown, I copied from my personal 
> wiki). Hope this helps the one or other out there…
>
>
> # Preserving indices
> Pandoc does not do indices. So to keep them, unzip the Word file, open 
> `document.xml` in Atom, and replace the index entries with the LaTeX 
> command.
>
> First make all xml-tags stand on one line. And replace all index entries 
> by a LaTeX-command. I am using `++` instead of the `\` to make later 
> replacement in the TeX-file easier. _Of course, check after conversion in 
> Pandoc that the `++` needs to me manually turned into a `\` for the 
> index-commands to work._
>
> Now the difficult part: any lines before and after a line starting with 
> `++index` need to be removed, from and until a line starting with something 
> else than `<`. And, there could be several index commands running into each 
> other without any normal text between.
>
> So we pull out any index entry and make a `++index` out of it using `sed`, 
> which is dirty, but quick. Then perl is dropped onto the file to pull the 
> `++index` before any xml-tags, which puts it right behind the normal text, 
> where they should go later. Then we write the rest out as is, to make sure 
> all xml-tags Word kept open are properly closed. One small alteration: any 
> text in the Word index command is simply replaced by _FOO_, so it would be 
> easier to track if anything went wrong or these indices somehow pop up 
> again.
>
> This can be achieved with the following perl-script saved in 
> WordIndex2LaTeX.pl (and made executable, so chmod +x WordIndex2LaTeX.pl):
> ```
> #!/usr/bin/perl
>
> # @Author: Hendrik G. Seliger
> # @Date:   4 February 2022, 11:34 +01:00
> # @Filename: WordIndex2LaTeX.pl
> # @Last modified time: 4 February 2022, 11:36 +01:00
> # @License: GPL3
> # @Copyright: © Copyright 2022 by Hendrik G. Seliger
>
> ###
> $keptlines='';
> $indexlines='';
>
> while ( <STDIN> ) {
>         if ( ( $_ =~ /^</ ) || ($_ =~ /^XE / ) ) { # line with xml tag or 
> word index entry
>                 $keptlines .= $_; # save the line
>         } elsif ( $_ =~ /^++index/ ) {
>                 # Found an index entry. Now, put the LaTeX command BEFORE 
> the
>                 # Word tag, so that the tags are correctly opened and 
> closed, but
>                 # the LaTeX command appears first
>                 $indexlines .= $_; # save the line
>         } else { # normal text line, print all kept ones and current, 
> erase memory
>                 print $indexlines;
>                 print $keptlines;
>                 print $_;
>                 $keptlines='';
>                 $indexlines='';
>         }
> }
> print $indexlines;
> print $keptlines;
> ```
>
> So hence the conversion is achieved with
>
> ```
> mkdir D
> cd D
> unzip ../MyDoc.docx
> cd word
> cat document.xml|sed -E 's/>/>\n/g' |sed -E 's/(.)</\1\n</g' | sed -E 
> 's/^XE &quot;(.*)&quot;/++index{\1}\nXE \&quot;FOO\&quot;/g' | 
> ../../WordIndex2LaTeX.pl| tr -d '\n' >document2.xml
> ```
>
> Back up `document.xml` and rename `document2.xml` to `document.xml`. 
> Re-zip the document
> ```
> mv document.xml ../..
> mv document2.xml document.xml
> cd ..
> zip -r ../D.docx *
> cd ..
> ```
>
>
> John MacFarlane schrieb am Montag, 30. August 2021 um 04:07:17 UTC+2:
>
>>
>> Sorry, indexes aren't supported. 
>>
>> DJ Penton <jakep...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes: 
>>
>> > I am new to pandoc. I am under enormous time pressure to convert a docx 
>> > file to latex. This has worked beautifully except that alphabetical 
>> index 
>> > entries in the docx file do not seem to be preserved. I would have 
>> expected 
>> > a latex \index{} tag. Is there a way to do this? 
>> > 
>> > I apologise for asking a question that has probably been answered 
>> > repeatedly. I spent 15 minutes searching for an answer and didn't see 
>> one. 
>> > Probably I just missed it. I must continue with other work on the 
>> document 
>> > for now. 
>> > 
>> > Anyway, thanks in advance; be kind :-) 
>> > 
>> > -- 
>> > You received this message because you are subscribed to the Google 
>> Groups "pandoc-discuss" group. 
>> > To unsubscribe from this group and stop receiving emails from it, send 
>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org 
>> > To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/pandoc-discuss/3da0bbf7-36e8-4511-9766-7777ec427133n%40googlegroups.com. 
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/146889bd-867e-4b83-8b21-98bb01e559d7n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 7374 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Retaining index entries in docx file
       [not found]         ` <360a19b9-7a61-4452-8a57-e72cd68f05a1n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2022-02-04 13:58           ` Hendrik Seliger
@ 2022-02-04 14:38           ` BPJ
  1 sibling, 0 replies; 5+ messages in thread
From: BPJ @ 2022-02-04 14:38 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 6458 bytes --]

> please excuse the Markdown

No excuses needed IMVMNHO, least of all *here*.

Markdown was based on customary email/Usenet markup. It is HTML email which
is in want of an excuse!

Den fre 4 feb. 2022 14:47Hendrik Seliger <hgseliger-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> skrev:

> Hi!
>
> I just ran into the same problem and solved it somewhat manually, but it
> worked well for a 200+ pages document.
> Basic approach: unzip the docx file (these are zip-archives with a
> different extension to the filename), then tweak it a bit to put a text
> entry of the index we can later use in the file converted by Pandoc, re-zip
> the beast and the use Pandoc to create a LaTeX file. With the below
> scripts, the converted file would have a '++index{Index entry}' for each
> entry. Of course, the '++index' needs to be find-and-replaced to '\index'.
> Done.
>
> Here the details (please excuse the Markdown, I copied from my personal
> wiki). Hope this helps the one or other out there…
>
>
> # Preserving indices
> Pandoc does not do indices. So to keep them, unzip the Word file, open
> `document.xml` in Atom, and replace the index entries with the LaTeX
> command.
>
> First make all xml-tags stand on one line. And replace all index entries
> by a LaTeX-command. I am using `++` instead of the `\` to make later
> replacement in the TeX-file easier. _Of course, check after conversion in
> Pandoc that the `++` needs to me manually turned into a `\` for the
> index-commands to work._
>
> Now the difficult part: any lines before and after a line starting with
> `++index` need to be removed, from and until a line starting with something
> else than `<`. And, there could be several index commands running into each
> other without any normal text between.
>
> So we pull out any index entry and make a `++index` out of it using `sed`,
> which is dirty, but quick. Then perl is dropped onto the file to pull the
> `++index` before any xml-tags, which puts it right behind the normal text,
> where they should go later. Then we write the rest out as is, to make sure
> all xml-tags Word kept open are properly closed. One small alteration: any
> text in the Word index command is simply replaced by _FOO_, so it would be
> easier to track if anything went wrong or these indices somehow pop up
> again.
>
> This can be achieved with the following perl-script saved in
> WordIndex2LaTeX.pl (and made executable, so chmod +x WordIndex2LaTeX.pl):
> ```
> #!/usr/bin/perl
>
> # @Author: Hendrik G. Seliger
> # @Date:   4 February 2022, 11:34 +01:00
> # @Filename: WordIndex2LaTeX.pl
> # @Last modified time: 4 February 2022, 11:36 +01:00
> # @License: GPL3
> # @Copyright: © Copyright 2022 by Hendrik G. Seliger
>
> ###
> $keptlines='';
> $indexlines='';
>
> while ( <STDIN> ) {
>         if ( ( $_ =~ /^</ ) || ($_ =~ /^XE / ) ) { # line with xml tag or
> word index entry
>                 $keptlines .= $_; # save the line
>         } elsif ( $_ =~ /^++index/ ) {
>                 # Found an index entry. Now, put the LaTeX command BEFORE
> the
>                 # Word tag, so that the tags are correctly opened and
> closed, but
>                 # the LaTeX command appears first
>                 $indexlines .= $_; # save the line
>         } else { # normal text line, print all kept ones and current,
> erase memory
>                 print $indexlines;
>                 print $keptlines;
>                 print $_;
>                 $keptlines='';
>                 $indexlines='';
>         }
> }
> print $indexlines;
> print $keptlines;
> ```
>
> So hence the conversion is achieved with
>
> ```
> mkdir D
> cd D
> unzip ../MyDoc.docx
> cd word
> cat document.xml|sed -E 's/>/>\n/g' |sed -E 's/(.)</\1\n</g' | sed -E
> 's/^XE &quot;(.*)&quot;/++index{\1}\nXE \&quot;FOO\&quot;/g' |
> ../../WordIndex2LaTeX.pl| tr -d '\n' >document2.xml
> ```
>
> Back up `document.xml` and rename `document2.xml` to `document.xml`.
> Re-zip the document
> ```
> mv document.xml ../..
> mv document2.xml document.xml
> cd ..
> zip -r ../D.docx *
> cd ..
> ```
>
>
> John MacFarlane schrieb am Montag, 30. August 2021 um 04:07:17 UTC+2:
>
>>
>> Sorry, indexes aren't supported.
>>
>> DJ Penton <jakep...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>>
>> > I am new to pandoc. I am under enormous time pressure to convert a docx
>> > file to latex. This has worked beautifully except that alphabetical
>> index
>> > entries in the docx file do not seem to be preserved. I would have
>> expected
>> > a latex \index{} tag. Is there a way to do this?
>> >
>> > I apologise for asking a question that has probably been answered
>> > repeatedly. I spent 15 minutes searching for an answer and didn't see
>> one.
>> > Probably I just missed it. I must continue with other work on the
>> document
>> > for now.
>> >
>> > Anyway, thanks in advance; be kind :-)
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> Groups "pandoc-discuss" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> > To view this discussion on the web visit
>> https://groups.google.com/d/msgid/pandoc-discuss/3da0bbf7-36e8-4511-9766-7777ec427133n%40googlegroups.com.
>>
>>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/360a19b9-7a61-4452-8a57-e72cd68f05a1n%40googlegroups.com
> <https://groups.google.com/d/msgid/pandoc-discuss/360a19b9-7a61-4452-8a57-e72cd68f05a1n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhCfp-riTqCg%2BVFU%3DwY-9Ut%2BoUeest_qZEgzD4upgbR38A%40mail.gmail.com.

[-- Attachment #2: Type: text/html, Size: 8041 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-02-04 14:38 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-29 17:59 Retaining index entries in docx file DJ Penton
     [not found] ` <3da0bbf7-36e8-4511-9766-7777ec427133n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-08-30  2:07   ` John MacFarlane
     [not found]     ` <m2ilznbs8p.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
2022-02-04 13:46       ` Hendrik Seliger
     [not found]         ` <360a19b9-7a61-4452-8a57-e72cd68f05a1n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-02-04 13:58           ` Hendrik Seliger
2022-02-04 14:38           ` BPJ

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).