public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* preserving blockquote spaces when converting from docx
@ 2017-10-30 14:39 Stefano Zacchiroli
       [not found] ` <5d590149-457b-4de3-b863-57f70366e6d9-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Stefano Zacchiroli @ 2017-10-30 14:39 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 2902 bytes --]

Heya,
I'm using pandoc to convert documents exported from Google Docs in docx 
format to reStructuredText — or anything else, really, the issue I'm facing 
seems independent from the output format.
A concrete example is this 
document: https://docs.google.com/document/d/1wAMVrKIA2qtRGmoVDSUBJGmYZSygUaR0uOMW1GV3YE0/edit#heading=h.9zjhwskw53j8 
, which uses indented paragraphs for code samples.

I'm trying to preserve the spaces used for indentation in those code 
samples, but I'm failing to convince pandoc to do so.

The spaces I'm interested in are there in the docx. Here's an excerpt from 
its xml markup:

      <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
        <w:rPr>
          <w:rFonts w:ascii="Consolas" w:cs="Consolas" 
w:eastAsia="Consolas" w:hAnsi="Consolas"/>
          <w:sz w:val="20"/>
          <w:szCs w:val="20"/>
          <w:rtl w:val="0"/>
        </w:rPr>
        <w:t xml:space="preserve">2015-01-01 * "Taxi home from concert in 
Brooklyn"</w:t>
        <w:br w:type="textWrapping"/>
        <w:t xml:space="preserve">  Assets:Cash      -20 USD  ; inline 
comment</w:t>
        <w:br w:type="textWrapping"/>
        <w:t xml:space="preserve">  Expenses:Taxi</w:t>
      </w:r>

Beancount recognizes those paragraphs as blockquotes (not sure why, maybe 
on the basis of their indentation?), e.g.:

  ,BlockQuote
 [Para [Strong [Str ";",Space,Str "I",Space,Str "paid",Space,Str 
"and",Space,Str "left",Space,Str "the",Space,Str "taxi,",Space,Str 
"forgot",Space,Str "to",Space,Str "take",Space,Str "change,",Space,Str 
"it",Space,Str "was",Space,Str "cold.",LineBreak],Str 
"2015-01-01",Space,Str "*",Space,Str "\"Taxi",Space,Str "home",Space,Str 
"from",Space,Str "concert",Space,Str "in",Space,Str 
"Brooklyn\"",LineBreak,Str "Assets:Cash",Space,Str "-20",Space,Str 
"USD",Space,Str ";",Space,Str "inline",Space,Str "comment",LineBreak,Str 
"Expenses:Taxi"]]

but note how it has normalized spaces to single Space tokens used as 
separators.

Is there a way to tell pandoc to preserve those spaces, which are valuable 
to me, in anything it decides is a BlockQuote ?
(Note that this happens before external filters are called, so AFAICT I 
can't work around this using --filter)

Many thanks in advance,
Cheers.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/5d590149-457b-4de3-b863-57f70366e6d9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 4255 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: preserving blockquote spaces when converting from docx
       [not found] ` <5d590149-457b-4de3-b863-57f70366e6d9-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-10-30 17:09   ` John MACFARLANE
  2017-10-31  7:52     ` Stefano Zacchiroli
  0 siblings, 1 reply; 3+ messages in thread
From: John MACFARLANE @ 2017-10-30 17:09 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

You might try using nonbreaking spaces in the docx;
pandoc should preserve those.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: preserving blockquote spaces when converting from docx
  2017-10-30 17:09   ` John MACFARLANE
@ 2017-10-31  7:52     ` Stefano Zacchiroli
  0 siblings, 0 replies; 3+ messages in thread
From: Stefano Zacchiroli @ 2017-10-31  7:52 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On Mon, Oct 30, 2017 at 10:09:10AM -0700, John MACFARLANE wrote:
> You might try using nonbreaking spaces in the docx;
> pandoc should preserve those.

Thanks for your answer! As I understand that would indeed work, but
unfortunately I don't control the documents myself. Also, it would be
annoying to change those spaces into nbsp, which they conceptually
aren't, just to make pandoc preserve them.

Is there really no way to hook into pandoc *before* it decides to remove
them?

Alternatively, what would be a definition of the policy that pandoc uses
to decide those are blockquotes? If nothing else works, what I can do is
writing a docx filter that adds the nbsp automatically, but I need to be
sure that I do that only where pandoc will "see" blockquotes.

TIA,
Cheers.
-- 
Stefano Zacchiroli . zack-CfJEcLwHECWjKv3TNrM5DQ@public.gmane.org . upsilon.cc/zack . . o . . . o . o
Computer Science Professor . CTO Software Heritage . . . . . o . . . o o
Former Debian Project Leader & OSI Board Director  . . . o o o . . . o .
« the first rule of tautology club is the first rule of tautology club »

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20171031075215.dwxkum55skzs52d3%40upsilon.cc.
For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-10-31  7:52 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-30 14:39 preserving blockquote spaces when converting from docx Stefano Zacchiroli
     [not found] ` <5d590149-457b-4de3-b863-57f70366e6d9-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-10-30 17:09   ` John MACFARLANE
2017-10-31  7:52     ` Stefano Zacchiroli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).