public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* docx to epub, keep id values
@ 2020-02-19  6:16 Finn Mathisen
       [not found] ` <76525fee-b7d6-408f-86a8-99e811592460-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Finn Mathisen @ 2020-02-19  6:16 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 675 bytes --]

I am converting word docx format to epub using Pandoc.
During conversion, pandoc substitute original bookmarks id with title text 
from heading1, heading2 and so on.
I need to keep the original id's from the docx file.

Any solution to this?

Regards
Finn Mathisen

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/76525fee-b7d6-408f-86a8-99e811592460%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 1061 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: docx to epub, keep id values
       [not found] ` <76525fee-b7d6-408f-86a8-99e811592460-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-02-19 15:06   ` Dmitriy Krasilnikov
       [not found]     ` <72991876.20200219180606-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2020-02-20  8:17   ` Finn Mathisen
  1 sibling, 1 reply; 5+ messages in thread
From: Dmitriy Krasilnikov @ 2020-02-19 15:06 UTC (permalink / raw)
  To: Finn Mathisen

> I am converting word docx format to epub using Pandoc.
> During conversion, pandoc substitute original bookmarks id with title text from heading1, heading2 and so on.
> I need to keep the original id's from the docx file.

> Any solution to this?

As far as I can see, pandoc docx reader recognizes mostly eye-seen text and docx
styles. It has no support for docx constructions like bookmarks
(<w:bookmarkStart w:id="0" w:name="this_book_is_about_something"/><w:bookmarkEnd
w:id="0"/>) or index markers (in Word: { XE "marker" }).

-- 
С уважением,
Красильников Д.И.
8-915-204-33-42

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/72991876.20200219180606%40gmail.com.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: docx to epub, keep id values
       [not found]     ` <72991876.20200219180606-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2020-02-19 17:31       ` John MacFarlane
       [not found]         ` <m2ftf6xz4h.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: John MacFarlane @ 2020-02-19 17:31 UTC (permalink / raw)
  To: Dmitriy Krasilnikov, Finn Mathisen


There is some code in the docx reader that reads bookmarks.
However, it doesn't seem to use bookmarks in headers for the
header IDs.  I couldn't quite tell why, but Jesse Rosenthal
(author of the docx reader) could probably answer this question.

Dmitriy Krasilnikov <dmitriy.krasilnikov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

>> I am converting word docx format to epub using Pandoc.
>> During conversion, pandoc substitute original bookmarks id with title text from heading1, heading2 and so on.
>> I need to keep the original id's from the docx file.
>
>> Any solution to this?
>
> As far as I can see, pandoc docx reader recognizes mostly eye-seen text and docx
> styles. It has no support for docx constructions like bookmarks
> (<w:bookmarkStart w:id="0" w:name="this_book_is_about_something"/><w:bookmarkEnd
> w:id="0"/>) or index markers (in Word: { XE "marker" }).
>
> -- 
> С уважением,
> Красильников Д.И.
> 8-915-204-33-42
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/72991876.20200219180606%40gmail.com.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/m2ftf6xz4h.fsf%40johnmacfarlane.net.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: docx to epub, keep id values
       [not found]         ` <m2ftf6xz4h.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
@ 2020-02-19 20:07           ` BPJ
  0 siblings, 0 replies; 5+ messages in thread
From: BPJ @ 2020-02-19 20:07 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 3023 bytes --]

There is a Python module for modifying docx.

https://python-docx.readthedocs.io/en/latest/

Perhaps you could use that to insert a character run with a particular
style containing the docx bookmark id after the heading text and then use a
Pandoc filter to visit headings and insert the text of the resulting span
as the correct id into the heading, and then remove the span of course.
Hopefully not as complicated as it sounds! NB that what Pandoc calls a
Header and what DOCX calls a header aren't the same thing. In DOCX a
heading is just a paragraph with a particular predefined style.


Den ons 19 feb. 2020 18:32John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> skrev:

>
> There is some code in the docx reader that reads bookmarks.
> However, it doesn't seem to use bookmarks in headers for the
> header IDs.  I couldn't quite tell why, but Jesse Rosenthal
> (author of the docx reader) could probably answer this question.
>
> Dmitriy Krasilnikov <dmitriy.krasilnikov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>
> >> I am converting word docx format to epub using Pandoc.
> >> During conversion, pandoc substitute original bookmarks id with title
> text from heading1, heading2 and so on.
> >> I need to keep the original id's from the docx file.
> >
> >> Any solution to this?
> >
> > As far as I can see, pandoc docx reader recognizes mostly eye-seen text
> and docx
> > styles. It has no support for docx constructions like bookmarks
> > (<w:bookmarkStart w:id="0"
> w:name="this_book_is_about_something"/><w:bookmarkEnd
> > w:id="0"/>) or index markers (in Word: { XE "marker" }).
> >
> > --
> > С уважением,
> > Красильников Д.И.
> > 8-915-204-33-42
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/72991876.20200219180606%40gmail.com
> .
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/m2ftf6xz4h.fsf%40johnmacfarlane.net
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhDLs9db2G7QPdOvm6Lrq9oXFoJg%2BunFvYYpG7HeGaQCUg%40mail.gmail.com.

[-- Attachment #2: Type: text/html, Size: 4523 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: docx to epub, keep id values
       [not found] ` <76525fee-b7d6-408f-86a8-99e811592460-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2020-02-19 15:06   ` Dmitriy Krasilnikov
@ 2020-02-20  8:17   ` Finn Mathisen
  1 sibling, 0 replies; 5+ messages in thread
From: Finn Mathisen @ 2020-02-20  8:17 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 901 bytes --]

Thank you all for tips and hints!
I think i got an idea of how my problem can be solved!

Regards
Finn Mathisen

onsdag 19. februar 2020 07.16.23 UTC+1 skrev Finn Mathisen følgende:
>
> I am converting word docx format to epub using Pandoc.
> During conversion, pandoc substitute original bookmarks id with title text 
> from heading1, heading2 and so on.
> I need to keep the original id's from the docx file.
>
> Any solution to this?
>
> Regards
> Finn Mathisen
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/c316bb9b-6bef-4c2b-b51a-074ed85d2237%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 1450 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-02-20  8:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-19  6:16 docx to epub, keep id values Finn Mathisen
     [not found] ` <76525fee-b7d6-408f-86a8-99e811592460-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-02-19 15:06   ` Dmitriy Krasilnikov
     [not found]     ` <72991876.20200219180606-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2020-02-19 17:31       ` John MacFarlane
     [not found]         ` <m2ftf6xz4h.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
2020-02-19 20:07           ` BPJ
2020-02-20  8:17   ` Finn Mathisen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).