mapping locations between docx and md

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* mapping locations between docx and md
@ 2016-03-14 14:34 James H
       [not found] ` <29dd3c46-c9af-46a3-939e-c9857d3477c5-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: James H @ 2016-03-14 14:34 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1.1: Type: text/plain, Size: 1057 bytes --]

Hello all. 

What's the best way to preserve the original location of text extracted 
from .docx file when converting to markdown?

For example, say that I have string in my markdown,  I want to find its 
exact character location (starting offset) of where it occurs in the 
original .docx file.   I'm using the Microsoft office interop to find the 
text in a word document and am trying to find a way to align my word .docx 
documents with the MD file.  

Thanks in advance! 

James 

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/29dd3c46-c9af-46a3-939e-c9857d3477c5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 1575 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mapping locations between docx and md
       [not found] ` <29dd3c46-c9af-46a3-939e-c9857d3477c5-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2016-03-14 15:18   ` Matthew Pickering
  2016-03-14 15:21   ` Jesse Rosenthal
  2016-03-15 14:51   ` Joachim Schiele
  2 siblings, 0 replies; 7+ messages in thread
From: Matthew Pickering @ 2016-03-14 15:18 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Unless things have changed since I last looked. I suspect you would
have to significantly modify pandoc in order to do this. That being
said, I don't really understand what you want to do this for, maybe
someone could suggest a work around if you provided a fuller
description?

On Mon, Mar 14, 2016 at 2:34 PM, James H <jamesrharwood-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hello all.
>
> What's the best way to preserve the original location of text extracted from
> .docx file when converting to markdown?
>
> For example, say that I have string in my markdown,  I want to find its
> exact character location (starting offset) of where it occurs in the
> original .docx file.   I'm using the Microsoft office interop to find the
> text in a word document and am trying to find a way to align my word .docx
> documents with the MD file.
>
> Thanks in advance!
>
> James
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/29dd3c46-c9af-46a3-939e-c9857d3477c5%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mapping locations between docx and md
       [not found] ` <29dd3c46-c9af-46a3-939e-c9857d3477c5-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2016-03-14 15:18   ` Matthew Pickering
@ 2016-03-14 15:21   ` Jesse Rosenthal
       [not found]     ` <m1k2l5axv9.fsf-4GNroTWusrE@public.gmane.org>
  2016-03-15 14:51   ` Joachim Schiele
  2 siblings, 1 reply; 7+ messages in thread
From: Jesse Rosenthal @ 2016-03-14 15:21 UTC (permalink / raw)
  To: James H, pandoc-discuss

James H <jamesrharwood-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> What's the best way to preserve the original location of text extracted 
> from .docx file when converting to markdown?

I can't see any good way offhand, for a couple of reasons:

 1. pandoc's internal document model doesn't have a concept of numerical
    character position.

 2. word's internal format (a collection of xml files) doesn't
    either. Footnotes are in a different file than the rest of the
    content, for example. Unless word has changed, I don't think it
    allows you to search by character offset. (It does let you search by
    line number, but that's related to the typesetter, which does
    the wrapping, not the actual docx file.)

Best,
Jesse


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mapping locations between docx and md
       [not found]     ` <m1k2l5axv9.fsf-4GNroTWusrE@public.gmane.org>
@ 2016-03-14 20:21       ` James Harwood
       [not found]         ` <CABNnZ-QefvT73Fk27c+y8SG3-paXCnKtWW3BoLFyinWRqFaXcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: James Harwood @ 2016-03-14 20:21 UTC (permalink / raw)
  To: Jesse Rosenthal, pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 1774 bytes --]

Thanks Jesse.

Do you know whether line breaks stay consistent when converting docx to md?

For instance, if I split my docx file by line breaks, and did the same to
my subsequent md file would I end up with the same number of blocks? Even
this would be good enough location info for what I need.

James



On Mon, 14 Mar 2016 at 15:21 Jesse Rosenthal <jrosenthal-4GNroTWusrE@public.gmane.org> wrote:

> James H <jamesrharwood-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>
> > What's the best way to preserve the original location of text extracted
> > from .docx file when converting to markdown?
>
> I can't see any good way offhand, for a couple of reasons:
>
>  1. pandoc's internal document model doesn't have a concept of numerical
>     character position.
>
>  2. word's internal format (a collection of xml files) doesn't
>     either. Footnotes are in a different file than the rest of the
>     content, for example. Unless word has changed, I don't think it
>     allows you to search by character offset. (It does let you search by
>     line number, but that's related to the typesetter, which does
>     the wrapping, not the actual docx file.)
>
> Best,
> Jesse
>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CABNnZ-QefvT73Fk27c%2By8SG3-paXCnKtWW3BoLFyinWRqFaXcg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 2729 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mapping locations between docx and md
       [not found]         ` <CABNnZ-QefvT73Fk27c+y8SG3-paXCnKtWW3BoLFyinWRqFaXcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-03-14 20:27           ` Jesse Rosenthal
  2016-03-15 17:14           ` Daniel Staal
  1 sibling, 0 replies; 7+ messages in thread
From: Jesse Rosenthal @ 2016-03-14 20:27 UTC (permalink / raw)
  To: James Harwood, pandoc-discuss

No -- the line breaks aren't in the docx file itself. (Literal line
breaks, the sort you get from typing ctrl-enter or something, are in
there. But not normal wrapping linebreaks.) Word figures them
out based on the word-wrapping it does when it displays the file. So, in
other words, it requires the proprietary word program, not just the
file.

And line breaks in pandoc also depend on your settings (i.e. the
`--column` option). So, again, it's a display issue.

Which is all to say, I don't think pandoc can do what you're after here.

James Harwood <jamesrharwood-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> [ text/plain ]
> Thanks Jesse.
>
> Do you know whether line breaks stay consistent when converting docx to md?
>
> For instance, if I split my docx file by line breaks, and did the same to
> my subsequent md file would I end up with the same number of blocks? Even
> this would be good enough location info for what I need.
>
> James
>
>
>
> On Mon, 14 Mar 2016 at 15:21 Jesse Rosenthal <jrosenthal-4GNroTWusrE@public.gmane.org> wrote:
>
>> James H <jamesrharwood-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>>
>> > What's the best way to preserve the original location of text extracted
>> > from .docx file when converting to markdown?
>>
>> I can't see any good way offhand, for a couple of reasons:
>>
>>  1. pandoc's internal document model doesn't have a concept of numerical
>>     character position.
>>
>>  2. word's internal format (a collection of xml files) doesn't
>>     either. Footnotes are in a different file than the rest of the
>>     content, for example. Unless word has changed, I don't think it
>>     allows you to search by character offset. (It does let you search by
>>     line number, but that's related to the typesetter, which does
>>     the wrapping, not the actual docx file.)
>>
>> Best,
>> Jesse
>>
>>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mapping locations between docx and md
       [not found] ` <29dd3c46-c9af-46a3-939e-c9857d3477c5-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2016-03-14 15:18   ` Matthew Pickering
  2016-03-14 15:21   ` Jesse Rosenthal
@ 2016-03-15 14:51   ` Joachim Schiele
  2 siblings, 0 replies; 7+ messages in thread
From: Joachim Schiele @ 2016-03-15 14:51 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On 14.03.2016 15:34, James H wrote:
> Hello all. 
> 
> What's the best way to preserve the original location of text extracted
> from .docx file when converting to markdown?
> 
> For example, say that I have string in my markdown,  I want to find its
> exact character location (starting offset) of where it occurs in the
> original .docx file.   I'm using the Microsoft office interop to find
> the text in a word document and am trying to find a way to align my word
> .docx documents with the MD file.  

i'm facing the same problem with my setup:

the editor looks like this:
| mdwn | html |

and now i want to scroll the right pane (preview) to the associated
position in the mdwn and vice versa. this is wonderfully done with
editors like stackedit.io for instance.

one way would be to search for text patterns or include invisible
markers like <div id="aaab"></div> which are also in the output and can
be focused on.

in my case i want to make incremental updates in the mdwn and highlight
the changed parts on the html. i currently think that i have to compare
the two DOM-trees on each change. haven't played with that yet but i
didn't find many good libraries for that.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mapping locations between docx and md
       [not found]         ` <CABNnZ-QefvT73Fk27c+y8SG3-paXCnKtWW3BoLFyinWRqFaXcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-03-14 20:27           ` Jesse Rosenthal
@ 2016-03-15 17:14           ` Daniel Staal
  1 sibling, 0 replies; 7+ messages in thread
From: Daniel Staal @ 2016-03-15 17:14 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

--As of March 14, 2016 8:21:29 PM +0000, James Harwood is alleged to have 
said:

> Do you know whether line breaks stay consistent when converting docx to
> md?
>
> For instance, if I split my docx file by line breaks, and did the same to
> my subsequent md file would I end up with the same number of blocks? Even
> this would be good enough location info for what I need.

--As for the rest, it is mine.

If by 'blocks' you mean 'paragraphs', yes.  If by 'blocks' you mean 
'lines', no.  ;)  It depends on how you define block and line break - 
Markdown treats contiguous lines as a single paragraph, and lines separated 
by at least one empty line as separate paragraphs.  Pandoc will preserve 
paragraphs when converting, but the line length within that will depend on 
your settings (for Markdown) and the output format.

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-03-15 17:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-14 14:34 mapping locations between docx and md James H
     [not found] ` <29dd3c46-c9af-46a3-939e-c9857d3477c5-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2016-03-14 15:18   ` Matthew Pickering
2016-03-14 15:21   ` Jesse Rosenthal
     [not found]     ` <m1k2l5axv9.fsf-4GNroTWusrE@public.gmane.org>
2016-03-14 20:21       ` James Harwood
     [not found]         ` <CABNnZ-QefvT73Fk27c+y8SG3-paXCnKtWW3BoLFyinWRqFaXcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-03-14 20:27           ` Jesse Rosenthal
2016-03-15 17:14           ` Daniel Staal
2016-03-15 14:51   ` Joachim Schiele

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).