* Re: Retaining index entries in docx file
[not found] ` <360a19b9-7a61-4452-8a57-e72cd68f05a1n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2022-02-04 13:58 ` Hendrik Seliger
2022-02-04 14:38 ` BPJ
1 sibling, 0 replies; 5+ messages in thread
From: Hendrik Seliger @ 2022-02-04 13:58 UTC (permalink / raw)
To: pandoc-discuss
[-- Attachment #1.1: Type: text/plain, Size: 5932 bytes --]
BTW, just see thatP Pandoc of course also escapes the {} after index, so
those need to be fixed as well. Easiest by pushing through sed, too:
pandoc D.docx -t latex | sed -E 's/\+\+index\\\{(.*?)\\\}/\\index{\1}/g'
>D.tex
Hendrik Seliger schrieb am Freitag, 4. Februar 2022 um 14:46:40 UTC+1:
> Hi!
>
> I just ran into the same problem and solved it somewhat manually, but it
> worked well for a 200+ pages document.
> Basic approach: unzip the docx file (these are zip-archives with a
> different extension to the filename), then tweak it a bit to put a text
> entry of the index we can later use in the file converted by Pandoc, re-zip
> the beast and the use Pandoc to create a LaTeX file. With the below
> scripts, the converted file would have a '++index{Index entry}' for each
> entry. Of course, the '++index' needs to be find-and-replaced to '\index'.
> Done.
>
> Here the details (please excuse the Markdown, I copied from my personal
> wiki). Hope this helps the one or other out there…
>
>
> # Preserving indices
> Pandoc does not do indices. So to keep them, unzip the Word file, open
> `document.xml` in Atom, and replace the index entries with the LaTeX
> command.
>
> First make all xml-tags stand on one line. And replace all index entries
> by a LaTeX-command. I am using `++` instead of the `\` to make later
> replacement in the TeX-file easier. _Of course, check after conversion in
> Pandoc that the `++` needs to me manually turned into a `\` for the
> index-commands to work._
>
> Now the difficult part: any lines before and after a line starting with
> `++index` need to be removed, from and until a line starting with something
> else than `<`. And, there could be several index commands running into each
> other without any normal text between.
>
> So we pull out any index entry and make a `++index` out of it using `sed`,
> which is dirty, but quick. Then perl is dropped onto the file to pull the
> `++index` before any xml-tags, which puts it right behind the normal text,
> where they should go later. Then we write the rest out as is, to make sure
> all xml-tags Word kept open are properly closed. One small alteration: any
> text in the Word index command is simply replaced by _FOO_, so it would be
> easier to track if anything went wrong or these indices somehow pop up
> again.
>
> This can be achieved with the following perl-script saved in
> WordIndex2LaTeX.pl (and made executable, so chmod +x WordIndex2LaTeX.pl):
> ```
> #!/usr/bin/perl
>
> # @Author: Hendrik G. Seliger
> # @Date: 4 February 2022, 11:34 +01:00
> # @Filename: WordIndex2LaTeX.pl
> # @Last modified time: 4 February 2022, 11:36 +01:00
> # @License: GPL3
> # @Copyright: © Copyright 2022 by Hendrik G. Seliger
>
> ###
> $keptlines='';
> $indexlines='';
>
> while ( <STDIN> ) {
> if ( ( $_ =~ /^</ ) || ($_ =~ /^XE / ) ) { # line with xml tag or
> word index entry
> $keptlines .= $_; # save the line
> } elsif ( $_ =~ /^++index/ ) {
> # Found an index entry. Now, put the LaTeX command BEFORE
> the
> # Word tag, so that the tags are correctly opened and
> closed, but
> # the LaTeX command appears first
> $indexlines .= $_; # save the line
> } else { # normal text line, print all kept ones and current,
> erase memory
> print $indexlines;
> print $keptlines;
> print $_;
> $keptlines='';
> $indexlines='';
> }
> }
> print $indexlines;
> print $keptlines;
> ```
>
> So hence the conversion is achieved with
>
> ```
> mkdir D
> cd D
> unzip ../MyDoc.docx
> cd word
> cat document.xml|sed -E 's/>/>\n/g' |sed -E 's/(.)</\1\n</g' | sed -E
> 's/^XE "(.*)"/++index{\1}\nXE \"FOO\"/g' |
> ../../WordIndex2LaTeX.pl| tr -d '\n' >document2.xml
> ```
>
> Back up `document.xml` and rename `document2.xml` to `document.xml`.
> Re-zip the document
> ```
> mv document.xml ../..
> mv document2.xml document.xml
> cd ..
> zip -r ../D.docx *
> cd ..
> ```
>
>
> John MacFarlane schrieb am Montag, 30. August 2021 um 04:07:17 UTC+2:
>
>>
>> Sorry, indexes aren't supported.
>>
>> DJ Penton <jakep...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>>
>> > I am new to pandoc. I am under enormous time pressure to convert a docx
>> > file to latex. This has worked beautifully except that alphabetical
>> index
>> > entries in the docx file do not seem to be preserved. I would have
>> expected
>> > a latex \index{} tag. Is there a way to do this?
>> >
>> > I apologise for asking a question that has probably been answered
>> > repeatedly. I spent 15 minutes searching for an answer and didn't see
>> one.
>> > Probably I just missed it. I must continue with other work on the
>> document
>> > for now.
>> >
>> > Anyway, thanks in advance; be kind :-)
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> Groups "pandoc-discuss" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> > To view this discussion on the web visit
>> https://groups.google.com/d/msgid/pandoc-discuss/3da0bbf7-36e8-4511-9766-7777ec427133n%40googlegroups.com.
>>
>>
>
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/146889bd-867e-4b83-8b21-98bb01e559d7n%40googlegroups.com.
[-- Attachment #1.2: Type: text/html, Size: 7374 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Retaining index entries in docx file
[not found] ` <360a19b9-7a61-4452-8a57-e72cd68f05a1n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-02-04 13:58 ` Hendrik Seliger
@ 2022-02-04 14:38 ` BPJ
1 sibling, 0 replies; 5+ messages in thread
From: BPJ @ 2022-02-04 14:38 UTC (permalink / raw)
To: pandoc-discuss
[-- Attachment #1: Type: text/plain, Size: 6458 bytes --]
> please excuse the Markdown
No excuses needed IMVMNHO, least of all *here*.
Markdown was based on customary email/Usenet markup. It is HTML email which
is in want of an excuse!
Den fre 4 feb. 2022 14:47Hendrik Seliger <hgseliger-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> skrev:
> Hi!
>
> I just ran into the same problem and solved it somewhat manually, but it
> worked well for a 200+ pages document.
> Basic approach: unzip the docx file (these are zip-archives with a
> different extension to the filename), then tweak it a bit to put a text
> entry of the index we can later use in the file converted by Pandoc, re-zip
> the beast and the use Pandoc to create a LaTeX file. With the below
> scripts, the converted file would have a '++index{Index entry}' for each
> entry. Of course, the '++index' needs to be find-and-replaced to '\index'.
> Done.
>
> Here the details (please excuse the Markdown, I copied from my personal
> wiki). Hope this helps the one or other out there…
>
>
> # Preserving indices
> Pandoc does not do indices. So to keep them, unzip the Word file, open
> `document.xml` in Atom, and replace the index entries with the LaTeX
> command.
>
> First make all xml-tags stand on one line. And replace all index entries
> by a LaTeX-command. I am using `++` instead of the `\` to make later
> replacement in the TeX-file easier. _Of course, check after conversion in
> Pandoc that the `++` needs to me manually turned into a `\` for the
> index-commands to work._
>
> Now the difficult part: any lines before and after a line starting with
> `++index` need to be removed, from and until a line starting with something
> else than `<`. And, there could be several index commands running into each
> other without any normal text between.
>
> So we pull out any index entry and make a `++index` out of it using `sed`,
> which is dirty, but quick. Then perl is dropped onto the file to pull the
> `++index` before any xml-tags, which puts it right behind the normal text,
> where they should go later. Then we write the rest out as is, to make sure
> all xml-tags Word kept open are properly closed. One small alteration: any
> text in the Word index command is simply replaced by _FOO_, so it would be
> easier to track if anything went wrong or these indices somehow pop up
> again.
>
> This can be achieved with the following perl-script saved in
> WordIndex2LaTeX.pl (and made executable, so chmod +x WordIndex2LaTeX.pl):
> ```
> #!/usr/bin/perl
>
> # @Author: Hendrik G. Seliger
> # @Date: 4 February 2022, 11:34 +01:00
> # @Filename: WordIndex2LaTeX.pl
> # @Last modified time: 4 February 2022, 11:36 +01:00
> # @License: GPL3
> # @Copyright: © Copyright 2022 by Hendrik G. Seliger
>
> ###
> $keptlines='';
> $indexlines='';
>
> while ( <STDIN> ) {
> if ( ( $_ =~ /^</ ) || ($_ =~ /^XE / ) ) { # line with xml tag or
> word index entry
> $keptlines .= $_; # save the line
> } elsif ( $_ =~ /^++index/ ) {
> # Found an index entry. Now, put the LaTeX command BEFORE
> the
> # Word tag, so that the tags are correctly opened and
> closed, but
> # the LaTeX command appears first
> $indexlines .= $_; # save the line
> } else { # normal text line, print all kept ones and current,
> erase memory
> print $indexlines;
> print $keptlines;
> print $_;
> $keptlines='';
> $indexlines='';
> }
> }
> print $indexlines;
> print $keptlines;
> ```
>
> So hence the conversion is achieved with
>
> ```
> mkdir D
> cd D
> unzip ../MyDoc.docx
> cd word
> cat document.xml|sed -E 's/>/>\n/g' |sed -E 's/(.)</\1\n</g' | sed -E
> 's/^XE "(.*)"/++index{\1}\nXE \"FOO\"/g' |
> ../../WordIndex2LaTeX.pl| tr -d '\n' >document2.xml
> ```
>
> Back up `document.xml` and rename `document2.xml` to `document.xml`.
> Re-zip the document
> ```
> mv document.xml ../..
> mv document2.xml document.xml
> cd ..
> zip -r ../D.docx *
> cd ..
> ```
>
>
> John MacFarlane schrieb am Montag, 30. August 2021 um 04:07:17 UTC+2:
>
>>
>> Sorry, indexes aren't supported.
>>
>> DJ Penton <jakep...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>>
>> > I am new to pandoc. I am under enormous time pressure to convert a docx
>> > file to latex. This has worked beautifully except that alphabetical
>> index
>> > entries in the docx file do not seem to be preserved. I would have
>> expected
>> > a latex \index{} tag. Is there a way to do this?
>> >
>> > I apologise for asking a question that has probably been answered
>> > repeatedly. I spent 15 minutes searching for an answer and didn't see
>> one.
>> > Probably I just missed it. I must continue with other work on the
>> document
>> > for now.
>> >
>> > Anyway, thanks in advance; be kind :-)
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> Groups "pandoc-discuss" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> > To view this discussion on the web visit
>> https://groups.google.com/d/msgid/pandoc-discuss/3da0bbf7-36e8-4511-9766-7777ec427133n%40googlegroups.com.
>>
>>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/360a19b9-7a61-4452-8a57-e72cd68f05a1n%40googlegroups.com
> <https://groups.google.com/d/msgid/pandoc-discuss/360a19b9-7a61-4452-8a57-e72cd68f05a1n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhCfp-riTqCg%2BVFU%3DwY-9Ut%2BoUeest_qZEgzD4upgbR38A%40mail.gmail.com.
[-- Attachment #2: Type: text/html, Size: 8041 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread