Re: Retaining index entries in docx file

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

From: BPJ <bpj-J3H7GcXPSITLoDKTGw+V6w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: Retaining index entries in docx file
Date: Fri, 4 Feb 2022 15:38:27 +0100	[thread overview]
Message-ID: <CADAJKhCfp-riTqCg+VFU=wY-9Ut+oUeest_qZEgzD4upgbR38A@mail.gmail.com> (raw)
In-Reply-To: <360a19b9-7a61-4452-8a57-e72cd68f05a1n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 6458 bytes --]

> please excuse the Markdown

No excuses needed IMVMNHO, least of all *here*.

Markdown was based on customary email/Usenet markup. It is HTML email which
is in want of an excuse!

Den fre 4 feb. 2022 14:47Hendrik Seliger <hgseliger-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> skrev:

> Hi!
>
> I just ran into the same problem and solved it somewhat manually, but it
> worked well for a 200+ pages document.
> Basic approach: unzip the docx file (these are zip-archives with a
> different extension to the filename), then tweak it a bit to put a text
> entry of the index we can later use in the file converted by Pandoc, re-zip
> the beast and the use Pandoc to create a LaTeX file. With the below
> scripts, the converted file would have a '++index{Index entry}' for each
> entry. Of course, the '++index' needs to be find-and-replaced to '\index'.
> Done.
>
> Here the details (please excuse the Markdown, I copied from my personal
> wiki). Hope this helps the one or other out there…
>
>
> # Preserving indices
> Pandoc does not do indices. So to keep them, unzip the Word file, open
> `document.xml` in Atom, and replace the index entries with the LaTeX
> command.
>
> First make all xml-tags stand on one line. And replace all index entries
> by a LaTeX-command. I am using `++` instead of the `\` to make later
> replacement in the TeX-file easier. _Of course, check after conversion in
> Pandoc that the `++` needs to me manually turned into a `\` for the
> index-commands to work._
>
> Now the difficult part: any lines before and after a line starting with
> `++index` need to be removed, from and until a line starting with something
> else than `<`. And, there could be several index commands running into each
> other without any normal text between.
>
> So we pull out any index entry and make a `++index` out of it using `sed`,
> which is dirty, but quick. Then perl is dropped onto the file to pull the
> `++index` before any xml-tags, which puts it right behind the normal text,
> where they should go later. Then we write the rest out as is, to make sure
> all xml-tags Word kept open are properly closed. One small alteration: any
> text in the Word index command is simply replaced by _FOO_, so it would be
> easier to track if anything went wrong or these indices somehow pop up
> again.
>
> This can be achieved with the following perl-script saved in
> WordIndex2LaTeX.pl (and made executable, so chmod +x WordIndex2LaTeX.pl):
> ```
> #!/usr/bin/perl
>
> # @Author: Hendrik G. Seliger
> # @Date:   4 February 2022, 11:34 +01:00
> # @Filename: WordIndex2LaTeX.pl
> # @Last modified time: 4 February 2022, 11:36 +01:00
> # @License: GPL3
> # @Copyright: © Copyright 2022 by Hendrik G. Seliger
>
> ###
> $keptlines='';
> $indexlines='';
>
> while ( <STDIN> ) {
>         if ( ( $_ =~ /^</ ) || ($_ =~ /^XE / ) ) { # line with xml tag or
> word index entry
>                 $keptlines .= $_; # save the line
>         } elsif ( $_ =~ /^++index/ ) {
>                 # Found an index entry. Now, put the LaTeX command BEFORE
> the
>                 # Word tag, so that the tags are correctly opened and
> closed, but
>                 # the LaTeX command appears first
>                 $indexlines .= $_; # save the line
>         } else { # normal text line, print all kept ones and current,
> erase memory
>                 print $indexlines;
>                 print $keptlines;
>                 print $_;
>                 $keptlines='';
>                 $indexlines='';
>         }
> }
> print $indexlines;
> print $keptlines;
> ```
>
> So hence the conversion is achieved with
>
> ```
> mkdir D
> cd D
> unzip ../MyDoc.docx
> cd word
> cat document.xml|sed -E 's/>/>\n/g' |sed -E 's/(.)</\1\n</g' | sed -E
> 's/^XE &quot;(.*)&quot;/++index{\1}\nXE \&quot;FOO\&quot;/g' |
> ../../WordIndex2LaTeX.pl| tr -d '\n' >document2.xml
> ```
>
> Back up `document.xml` and rename `document2.xml` to `document.xml`.
> Re-zip the document
> ```
> mv document.xml ../..
> mv document2.xml document.xml
> cd ..
> zip -r ../D.docx *
> cd ..
> ```
>
>
> John MacFarlane schrieb am Montag, 30. August 2021 um 04:07:17 UTC+2:
>
>>
>> Sorry, indexes aren't supported.
>>
>> DJ Penton <jakep...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>>
>> > I am new to pandoc. I am under enormous time pressure to convert a docx
>> > file to latex. This has worked beautifully except that alphabetical
>> index
>> > entries in the docx file do not seem to be preserved. I would have
>> expected
>> > a latex \index{} tag. Is there a way to do this?
>> >
>> > I apologise for asking a question that has probably been answered
>> > repeatedly. I spent 15 minutes searching for an answer and didn't see
>> one.
>> > Probably I just missed it. I must continue with other work on the
>> document
>> > for now.
>> >
>> > Anyway, thanks in advance; be kind :-)
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> Groups "pandoc-discuss" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> > To view this discussion on the web visit
>> https://groups.google.com/d/msgid/pandoc-discuss/3da0bbf7-36e8-4511-9766-7777ec427133n%40googlegroups.com.
>>
>>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/360a19b9-7a61-4452-8a57-e72cd68f05a1n%40googlegroups.com
> <https://groups.google.com/d/msgid/pandoc-discuss/360a19b9-7a61-4452-8a57-e72cd68f05a1n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhCfp-riTqCg%2BVFU%3DwY-9Ut%2BoUeest_qZEgzD4upgbR38A%40mail.gmail.com.

[-- Attachment #2: Type: text/html, Size: 8041 bytes --]

     prev parent reply	other threads:[~2022-02-04 14:38 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-29 17:59 DJ Penton
     [not found] ` <3da0bbf7-36e8-4511-9766-7777ec427133n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-08-30  2:07   ` John MacFarlane
     [not found]     ` <m2ilznbs8p.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
2022-02-04 13:46       ` Hendrik Seliger
     [not found]         ` <360a19b9-7a61-4452-8a57-e72cd68f05a1n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-02-04 13:58           ` Hendrik Seliger
2022-02-04 14:38           ` BPJ [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CADAJKhCfp-riTqCg+VFU=wY-9Ut+oUeest_qZEgzD4upgbR38A@mail.gmail.com' \
    --to=bpj-j3h7gcxpsitlodktgw+v6w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).