public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* DOCX to markdown. Poetry. Keep whitespace and verse structure.
@ 2018-12-28 12:51 Lars Bingchong
       [not found] ` <4dfd9f3f-ca60-40b4-9925-9618ef468000-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Lars Bingchong @ 2018-12-28 12:51 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 4197 bytes --]

Hello ladies and gentlemen. This is my first post in the "pandoc-discuss" 
group. Allow me to explain myself.

====TRYING TO====

* Convert a lot of DOCX documents, that have text structured like this:

----

Evigt liv til salg

Hvis nyeste forskning besad.
Gralen til evigt liv via morgenmad.
Ville du gå til bords?
Mæske dig i libidoens buffet.

Eller tror du, fordi du tror?
At evigt liv er en Guds givet gave.
Til frit valg på hylde 1.
Et omfavnende selv tak - det var så lidt.

At det eksisterer kan vi ikke bevise.
At det gør kan vi ikke benægte.
Fakta, aktualitetens modstander.
Og aktuelt er det evige liv for os.

På den ene eller den anden led.
Ønsker vi livet bliver ved.
For det er i det levne liv.
At livet giver 4.

Så hvad gør en klog.
Forsøger at leve evigt sæføli.
Om ikke i kød og dundrende mørkt blod.
Så ihukommelse af os selv i andre.

Skakmat
160603
(Genfødsel - evigt liv) - 2660 - kulturweekend

----

So what I would like pandoc to do when executing it on a DOCX document of 
the above type is:


   1. Keep the whitespace between the verses and the first line which is 
   the title
   2. Keep the verse structure so that lines that are not divided by a 
   whitespace line stay together

====TRIED====


   1. *sudo pandoc -s file.docx -t markdown -o mydoc.md --wrap=none 
   --extract-media . *--> that did not do the job
   2. Searching through this discussion group to see if this had already 
   been solved.
   3. Had a good look at the Pandoc documentation. Disclaimer, I have no 
   prior experience with LUA and have not used Pandoc to a great extend.
   4. Then I tried with a LUA filter, inspired by this disccusion >> 
   https://groups.google.com/forum/#!searchin/pandoc-discuss/paragraphs%7Csort:date/pandoc-discuss/wlP6AL11NIY/PxF4d6ilBQAJ
      1. I modified it a bit and ended up with.
   
```
function Pandoc(doc)
  local lb = pandoc.LineBlock(doc)
  for i,b in pairs(doc.content) do
    if b.t == "Para" and b.content ~= nil then
      table.insert(lb.content, b.content)
    end
  end
  return pandoc.Pandoc({lb}, doc.meta)
end
```
--> that gets the conversion in the right direction. Lines are not like this

```
Evigt liv til salg

Hvis nyeste forskning besad.

Gralen til evigt liv via morgenmad.

Ville du gå til bords?

Mæske dig i libidoens buffet.

Eller tror du, fordi du tror?
```

but like this:

```
| Evigt liv til salg
| Hvis nyeste forskning besad.
| Gralen til evigt liv via morgenmad.
| Ville du gå til bords?
| Mæske dig i libidoens buffet.
| Eller tror du, fordi du tror?
| At evigt liv er en Guds givet gave.
```

However as stated in the "..what I would like..." section, it does not:


   1. Keep the whitespace between the verses and the first line which is 
   the title
   2. Keep the verse structure so that lines that are not divided by a 
   whitespace line stay together

----

So I'm seeking help on how to accomplish what I want with a LUA filter, as 
this seems like the rigth path.

Thank you very much :-) and a happy new year (it's soon :-).

function Pandoc(doc)
  local lb = pandoc.LineBlock(doc)
  for i,b in pairs(doc.content) do
    if b.t == "Para" and b.content ~= nil then
      table.insert(lb.content, b.content)
    end
  end
  return pandoc.Pandoc({lb}, doc.meta)
end

sudo pandoc -s /Volumes/IBIGDATA/IBIG\ Data/Documents/POEMS\ -\ PHILOSOPHIES\ -\ WORDPLAY/FINISHED\ POEMS/DANISH/2016/Evigt\ liv\ til\ salg\ 160603.docx -t markdown -o mydoc.md --wrap=none --extract-media .

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/4dfd9f3f-ca60-40b4-9925-9618ef468000%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 6464 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-01-01  2:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-28 12:51 DOCX to markdown. Poetry. Keep whitespace and verse structure Lars Bingchong
     [not found] ` <4dfd9f3f-ca60-40b4-9925-9618ef468000-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-12-28 15:50   ` mb21
     [not found]     ` <6e2a4a4f-4cba-459c-8f15-b19726fe2496-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-12-29 15:39       ` Lars Bingchong
     [not found]         ` <e3573e37-0a5a-494f-9a4a-9eff62788682-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-12-31 18:09           ` BP Jonsson
     [not found]             ` <ba1957bc-8aea-2db2-b961-5ce72cf1861c-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2019-01-01  2:17               ` Lars Bingchong
2019-01-01  2:19       ` Lars Bingchong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).