* mapping locations between docx and md @ 2016-03-14 14:34 James H [not found] ` <29dd3c46-c9af-46a3-939e-c9857d3477c5-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: James H @ 2016-03-14 14:34 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 1057 bytes --] Hello all. What's the best way to preserve the original location of text extracted from .docx file when converting to markdown? For example, say that I have string in my markdown, I want to find its exact character location (starting offset) of where it occurs in the original .docx file. I'm using the Microsoft office interop to find the text in a word document and am trying to find a way to align my word .docx documents with the MD file. Thanks in advance! James -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/29dd3c46-c9af-46a3-939e-c9857d3477c5%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #1.2: Type: text/html, Size: 1575 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <29dd3c46-c9af-46a3-939e-c9857d3477c5-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: mapping locations between docx and md [not found] ` <29dd3c46-c9af-46a3-939e-c9857d3477c5-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2016-03-14 15:18 ` Matthew Pickering 2016-03-14 15:21 ` Jesse Rosenthal 2016-03-15 14:51 ` Joachim Schiele 2 siblings, 0 replies; 7+ messages in thread From: Matthew Pickering @ 2016-03-14 15:18 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw Unless things have changed since I last looked. I suspect you would have to significantly modify pandoc in order to do this. That being said, I don't really understand what you want to do this for, maybe someone could suggest a work around if you provided a fuller description? On Mon, Mar 14, 2016 at 2:34 PM, James H <jamesrharwood-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > Hello all. > > What's the best way to preserve the original location of text extracted from > .docx file when converting to markdown? > > For example, say that I have string in my markdown, I want to find its > exact character location (starting offset) of where it occurs in the > original .docx file. I'm using the Microsoft office interop to find the > text in a word document and am trying to find a way to align my word .docx > documents with the MD file. > > Thanks in advance! > > James > > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/29dd3c46-c9af-46a3-939e-c9857d3477c5%40googlegroups.com. > For more options, visit https://groups.google.com/d/optout. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mapping locations between docx and md [not found] ` <29dd3c46-c9af-46a3-939e-c9857d3477c5-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2016-03-14 15:18 ` Matthew Pickering @ 2016-03-14 15:21 ` Jesse Rosenthal [not found] ` <m1k2l5axv9.fsf-4GNroTWusrE@public.gmane.org> 2016-03-15 14:51 ` Joachim Schiele 2 siblings, 1 reply; 7+ messages in thread From: Jesse Rosenthal @ 2016-03-14 15:21 UTC (permalink / raw) To: James H, pandoc-discuss James H <jamesrharwood-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes: > What's the best way to preserve the original location of text extracted > from .docx file when converting to markdown? I can't see any good way offhand, for a couple of reasons: 1. pandoc's internal document model doesn't have a concept of numerical character position. 2. word's internal format (a collection of xml files) doesn't either. Footnotes are in a different file than the rest of the content, for example. Unless word has changed, I don't think it allows you to search by character offset. (It does let you search by line number, but that's related to the typesetter, which does the wrapping, not the actual docx file.) Best, Jesse ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <m1k2l5axv9.fsf-4GNroTWusrE@public.gmane.org>]
* Re: mapping locations between docx and md [not found] ` <m1k2l5axv9.fsf-4GNroTWusrE@public.gmane.org> @ 2016-03-14 20:21 ` James Harwood [not found] ` <CABNnZ-QefvT73Fk27c+y8SG3-paXCnKtWW3BoLFyinWRqFaXcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: James Harwood @ 2016-03-14 20:21 UTC (permalink / raw) To: Jesse Rosenthal, pandoc-discuss [-- Attachment #1: Type: text/plain, Size: 1774 bytes --] Thanks Jesse. Do you know whether line breaks stay consistent when converting docx to md? For instance, if I split my docx file by line breaks, and did the same to my subsequent md file would I end up with the same number of blocks? Even this would be good enough location info for what I need. James On Mon, 14 Mar 2016 at 15:21 Jesse Rosenthal <jrosenthal-4GNroTWusrE@public.gmane.org> wrote: > James H <jamesrharwood-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes: > > > What's the best way to preserve the original location of text extracted > > from .docx file when converting to markdown? > > I can't see any good way offhand, for a couple of reasons: > > 1. pandoc's internal document model doesn't have a concept of numerical > character position. > > 2. word's internal format (a collection of xml files) doesn't > either. Footnotes are in a different file than the rest of the > content, for example. Unless word has changed, I don't think it > allows you to search by character offset. (It does let you search by > line number, but that's related to the typesetter, which does > the wrapping, not the actual docx file.) > > Best, > Jesse > > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CABNnZ-QefvT73Fk27c%2By8SG3-paXCnKtWW3BoLFyinWRqFaXcg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #2: Type: text/html, Size: 2729 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <CABNnZ-QefvT73Fk27c+y8SG3-paXCnKtWW3BoLFyinWRqFaXcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: mapping locations between docx and md [not found] ` <CABNnZ-QefvT73Fk27c+y8SG3-paXCnKtWW3BoLFyinWRqFaXcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-03-14 20:27 ` Jesse Rosenthal 2016-03-15 17:14 ` Daniel Staal 1 sibling, 0 replies; 7+ messages in thread From: Jesse Rosenthal @ 2016-03-14 20:27 UTC (permalink / raw) To: James Harwood, pandoc-discuss No -- the line breaks aren't in the docx file itself. (Literal line breaks, the sort you get from typing ctrl-enter or something, are in there. But not normal wrapping linebreaks.) Word figures them out based on the word-wrapping it does when it displays the file. So, in other words, it requires the proprietary word program, not just the file. And line breaks in pandoc also depend on your settings (i.e. the `--column` option). So, again, it's a display issue. Which is all to say, I don't think pandoc can do what you're after here. James Harwood <jamesrharwood-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes: > [ text/plain ] > Thanks Jesse. > > Do you know whether line breaks stay consistent when converting docx to md? > > For instance, if I split my docx file by line breaks, and did the same to > my subsequent md file would I end up with the same number of blocks? Even > this would be good enough location info for what I need. > > James > > > > On Mon, 14 Mar 2016 at 15:21 Jesse Rosenthal <jrosenthal-4GNroTWusrE@public.gmane.org> wrote: > >> James H <jamesrharwood-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes: >> >> > What's the best way to preserve the original location of text extracted >> > from .docx file when converting to markdown? >> >> I can't see any good way offhand, for a couple of reasons: >> >> 1. pandoc's internal document model doesn't have a concept of numerical >> character position. >> >> 2. word's internal format (a collection of xml files) doesn't >> either. Footnotes are in a different file than the rest of the >> content, for example. Unless word has changed, I don't think it >> allows you to search by character offset. (It does let you search by >> line number, but that's related to the typesetter, which does >> the wrapping, not the actual docx file.) >> >> Best, >> Jesse >> >> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mapping locations between docx and md [not found] ` <CABNnZ-QefvT73Fk27c+y8SG3-paXCnKtWW3BoLFyinWRqFaXcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-03-14 20:27 ` Jesse Rosenthal @ 2016-03-15 17:14 ` Daniel Staal 1 sibling, 0 replies; 7+ messages in thread From: Daniel Staal @ 2016-03-15 17:14 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw --As of March 14, 2016 8:21:29 PM +0000, James Harwood is alleged to have said: > Do you know whether line breaks stay consistent when converting docx to > md? > > For instance, if I split my docx file by line breaks, and did the same to > my subsequent md file would I end up with the same number of blocks? Even > this would be good enough location info for what I need. --As for the rest, it is mine. If by 'blocks' you mean 'paragraphs', yes. If by 'blocks' you mean 'lines', no. ;) It depends on how you define block and line break - Markdown treats contiguous lines as a single paragraph, and lines separated by at least one empty line as separate paragraphs. Pandoc will preserve paragraphs when converting, but the line length within that will depend on your settings (for Markdown) and the output format. Daniel T. Staal --------------------------------------------------------------- This email copyright the author. Unless otherwise noted, you are expressly allowed to retransmit, quote, or otherwise use the contents for non-commercial purposes. This copyright will expire 5 years after the author's death, or in 30 years, whichever is longer, unless such a period is in excess of local copyright law. --------------------------------------------------------------- ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mapping locations between docx and md [not found] ` <29dd3c46-c9af-46a3-939e-c9857d3477c5-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2016-03-14 15:18 ` Matthew Pickering 2016-03-14 15:21 ` Jesse Rosenthal @ 2016-03-15 14:51 ` Joachim Schiele 2 siblings, 0 replies; 7+ messages in thread From: Joachim Schiele @ 2016-03-15 14:51 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw On 14.03.2016 15:34, James H wrote: > Hello all. > > What's the best way to preserve the original location of text extracted > from .docx file when converting to markdown? > > For example, say that I have string in my markdown, I want to find its > exact character location (starting offset) of where it occurs in the > original .docx file. I'm using the Microsoft office interop to find > the text in a word document and am trying to find a way to align my word > .docx documents with the MD file. i'm facing the same problem with my setup: the editor looks like this: | mdwn | html | and now i want to scroll the right pane (preview) to the associated position in the mdwn and vice versa. this is wonderfully done with editors like stackedit.io for instance. one way would be to search for text patterns or include invisible markers like <div id="aaab"></div> which are also in the output and can be focused on. in my case i want to make incremental updates in the mdwn and highlight the changed parts on the html. i currently think that i have to compare the two DOM-trees on each change. haven't played with that yet but i didn't find many good libraries for that. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-03-15 17:14 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-03-14 14:34 mapping locations between docx and md James H [not found] ` <29dd3c46-c9af-46a3-939e-c9857d3477c5-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2016-03-14 15:18 ` Matthew Pickering 2016-03-14 15:21 ` Jesse Rosenthal [not found] ` <m1k2l5axv9.fsf-4GNroTWusrE@public.gmane.org> 2016-03-14 20:21 ` James Harwood [not found] ` <CABNnZ-QefvT73Fk27c+y8SG3-paXCnKtWW3BoLFyinWRqFaXcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-03-14 20:27 ` Jesse Rosenthal 2016-03-15 17:14 ` Daniel Staal 2016-03-15 14:51 ` Joachim Schiele
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).