Extract media from different .docx files to same directory

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* Extract media from different .docx files to same directory
@ 2020-03-17 17:33 growig mail
       [not found] ` <d4ebfb5f-13a8-49d8-9a54-2a40dd0d9482-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: growig mail @ 2020-03-17 17:33 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1.1: Type: text/plain, Size: 1134 bytes --]

Goal

I'm converting some .docx to .md with pandoc. These .docx have images that, 
after conversion, were being placed in a directory 
(markdown-repository/media/) and it's URL was being referenced in the 
resulting .md file.

So the goal is to have the resulting .md files with links pointing to the 
proper images stored in markdown-repository/media/. For this to happen, all 
images under markdown-repository/media/ need to have an unique name.

The problem

For each conversion, the images were being smashed by the last conversion, 
because pandocs doesn't track the image names, it creates image1.png, 
image2.png, image3.png, etc... for each converted file.

How can i do this? Does pandoc have any option for this?

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/d4ebfb5f-13a8-49d8-9a54-2a40dd0d9482%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 1626 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Extract media from different .docx files to same directory
       [not found] ` <d4ebfb5f-13a8-49d8-9a54-2a40dd0d9482-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-03-18 12:00   ` EBkysko
       [not found]     ` <01bc2b4a-fa92-4b39-a31c-f95d0abf2272-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: EBkysko @ 2020-03-18 12:00 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1464 bytes --]

I hoped "--id-prefix" would work, but no go, it only works for footnotes.

Here's a *crude* solution as a Lua filter to prefix the images with the md 
output file name (I used "--extract-media" option in my tests) :

----------
local outputfile = PANDOC_STATE.output_file:gsub("%.md", "")

local function prefixName(s)
  return s:gsub("([^/]*)$", outputfile .. "_%1")
end

function Image(img)
  img.src = prefixName(img.src)
  return img
end

function Pandoc(doc)
  for fp, mt, contents in pandoc.mediabag.items() do
    local fpnew = prefixName(fp)
    pandoc.mediabag.insert(fpnew, mt, contents)
    pandoc.mediabag.delete(fp)
  end
end
----------

I thought changing the mediabag files names would suffice, but I also 
needed to make change in Image.
You could use something else for the prefix (eg sha1 hashes of image 
content, which would avoid duplicates under different names).
Maybe one could simple do a file rename in the mediabag, instead of 
creating a new one and deleting old one.
As I said, it's crude. Maybe I missed a simpler solution.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/01bc2b4a-fa92-4b39-a31c-f95d0abf2272%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 9546 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Extract media from different .docx files to same directory
       [not found]     ` <01bc2b4a-fa92-4b39-a31c-f95d0abf2272-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-03-18 13:15       ` EBkysko
       [not found]         ` <525ed042-e47c-4fc5-b13f-91560718ba83-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: EBkysko @ 2020-03-18 13:15 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 746 bytes --]

... although one must ask why the "items" loop above doesn't get caught in 
an infinite loop: we add to the mediabag, and the iteration should get to 
new items eventually, and so on, no?

Didn't happen in my limited tests, but could, so above code suspicious, 
unless we can know if iterator *only *iterates on original list of items.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/525ed042-e47c-4fc5-b13f-91560718ba83%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 1125 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Extract media from different .docx files to same directory
       [not found]         ` <525ed042-e47c-4fc5-b13f-91560718ba83-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-03-18 20:19           ` BPJ
       [not found]             ` <CADAJKhABAeRjQ9RAKbxo+-OPgd5NfCqiy1XP-iDCgKaTkxXYVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: BPJ @ 2020-03-18 20:19 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 1631 bytes --]

To be safe I would do this (untested):

```lua
for _,f in ipairs(pandoc.mediabag.list()) do
  local fp = f.path
  local mt, contents = pandoc.mediabag.lookup(fp)
  -- the rest as before
 end
```


Den ons 18 mars 2020 14:16EBkysko <ebkysko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> skrev:

> ... although one must ask why the "items" loop above doesn't get caught in
> an infinite loop: we add to the mediabag, and the iteration should get to
> new items eventually, and so on, no?
>
> Didn't happen in my limited tests, but could, so above code suspicious,
> unless we can know if iterator *only *iterates on original list of items.
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/525ed042-e47c-4fc5-b13f-91560718ba83%40googlegroups.com
> <https://groups.google.com/d/msgid/pandoc-discuss/525ed042-e47c-4fc5-b13f-91560718ba83%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhABAeRjQ9RAKbxo%2B-OPgd5NfCqiy1XP-iDCgKaTkxXYVg%40mail.gmail.com.

[-- Attachment #2: Type: text/html, Size: 2712 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Extract media from different .docx files to same directory
       [not found]             ` <CADAJKhABAeRjQ9RAKbxo+-OPgd5NfCqiy1XP-iDCgKaTkxXYVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2020-03-18 22:36               ` EBkysko
       [not found]                 ` <6d556139-e107-4ad4-a99d-62b07bd9ad3d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: EBkysko @ 2020-03-18 22:36 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1124 bytes --]

ah, yes, good, that way we're sure we iterate on a fixed list, a snapshot 
before the changes.

so with this, and a check on the mime type, we get something like :


-----
local outputfile = PANDOC_STATE.output_file:gsub("%.md", "")

local function prefixName(s)
  return s:gsub("([^/]*)$", outputfile .. "_%1")
end

function Image(img)
  img.src = prefixName(img.src)
  return img
end

function Pandoc(doc)
  for _,f in ipairs(pandoc.mediabag.list()) do
    local fp = f.path
    local mt, contents = pandoc.mediabag.lookup(fp)
    if mt:match("^[^/]*") == "image" then
      local fpnew = prefixName(fp)
      pandoc.mediabag.insert(fpnew, mt, contents)
      pandoc.mediabag.delete(fp)
    end
  end
end
-----


-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/6d556139-e107-4ad4-a99d-62b07bd9ad3d%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 11317 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Extract media from different .docx files to same directory
       [not found]                 ` <6d556139-e107-4ad4-a99d-62b07bd9ad3d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-03-18 23:41                   ` BPJ
       [not found]                     ` <CADAJKhBfF-_hkBLJ0M5AnAL4aWJMs03JDbNj3wjjD7P6_-6fcA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: BPJ @ 2020-03-18 23:41 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 2139 bytes --]

Also there is no guarantee that the file extension on the Markdown file is
`.md`. Some people use `.mkd` or `.txt` and it may even be an `.rst` file
so it may be better to use the old and tried generic pattern `%.[^%.]*$`,
which matches any extension, to match the file extension.


Den ons 18 mars 2020 23:37EBkysko <ebkysko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> skrev:

> ah, yes, good, that way we're sure we iterate on a fixed list, a snapshot
> before the changes.
>
> so with this, and a check on the mime type, we get something like :
>
>
> -----
> local outputfile = PANDOC_STATE.output_file:gsub("%.md", "")
>
> local function prefixName(s)
>   return s:gsub("([^/]*)$", outputfile .. "_%1")
> end
>
> function Image(img)
>   img.src = prefixName(img.src)
>   return img
> end
>
> function Pandoc(doc)
>   for _,f in ipairs(pandoc.mediabag.list()) do
>     local fp = f.path
>     local mt, contents = pandoc.mediabag.lookup(fp)
>     if mt:match("^[^/]*") == "image" then
>       local fpnew = prefixName(fp)
>       pandoc.mediabag.insert(fpnew, mt, contents)
>       pandoc.mediabag.delete(fp)
>     end
>   end
> end
> -----
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/6d556139-e107-4ad4-a99d-62b07bd9ad3d%40googlegroups.com
> <https://groups.google.com/d/msgid/pandoc-discuss/6d556139-e107-4ad4-a99d-62b07bd9ad3d%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhBfF-_hkBLJ0M5AnAL4aWJMs03JDbNj3wjjD7P6_-6fcA%40mail.gmail.com.

[-- Attachment #2: Type: text/html, Size: 8220 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Extract media from different .docx files to same directory
       [not found]                     ` <CADAJKhBfF-_hkBLJ0M5AnAL4aWJMs03JDbNj3wjjD7P6_-6fcA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2020-03-18 23:54                       ` EBkysko
       [not found]                         ` <a74b610a-732f-4595-9b93-7773542ad4c0-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: EBkysko @ 2020-03-18 23:54 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 551 bytes --]

True.
I personally use .txt most of the time, only used .md to follow OP's 
choice, but yes, costs nothing to widen the choice of extension.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/a74b610a-732f-4595-9b93-7773542ad4c0%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Extract media from different .docx files to same directory
       [not found]                         ` <a74b610a-732f-4595-9b93-7773542ad4c0-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-08-19  9:44                           ` Mario Valle
  0 siblings, 0 replies; 8+ messages in thread
From: Mario Valle @ 2020-08-19  9:44 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1182 bytes --]

If this could be of any help, I had the same problem and resorted to use 
the following script that worked quite well:

mkdir img
cnt=0
for i in *.docx
do
    k=`printf "%02d\n" $cnt`
    n="doc$k.md"
    pandoc -f docx --atx-headers --wrap=none --extract-media=img$k -t 
commonmark-smart-raw_html "$i" | \
    sed "s/img$k\/media\//img\/abcd-$k-/" > "$n"

    for t in img$k/media/image*.jpeg
    do
        mv $t img/abcd-$k-`basename $t`
    done
    rm -rf img$k

    cnt=$((cnt+1))
done
cat doc*.md >abcd.md

Il giorno giovedì 19 marzo 2020 alle 00:54:11 UTC+1 EBkysko ha scritto:

> True.
> I personally use .txt most of the time, only used .md to follow OP's 
> choice, but yes, costs nothing to widen the choice of extension.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/f28bad27-592c-421c-8985-b5a7e654d0ecn%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 2094 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-08-19  9:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-17 17:33 Extract media from different .docx files to same directory growig mail
     [not found] ` <d4ebfb5f-13a8-49d8-9a54-2a40dd0d9482-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-03-18 12:00   ` EBkysko
     [not found]     ` <01bc2b4a-fa92-4b39-a31c-f95d0abf2272-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-03-18 13:15       ` EBkysko
     [not found]         ` <525ed042-e47c-4fc5-b13f-91560718ba83-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-03-18 20:19           ` BPJ
     [not found]             ` <CADAJKhABAeRjQ9RAKbxo+-OPgd5NfCqiy1XP-iDCgKaTkxXYVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-03-18 22:36               ` EBkysko
     [not found]                 ` <6d556139-e107-4ad4-a99d-62b07bd9ad3d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-03-18 23:41                   ` BPJ
     [not found]                     ` <CADAJKhBfF-_hkBLJ0M5AnAL4aWJMs03JDbNj3wjjD7P6_-6fcA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-03-18 23:54                       ` EBkysko
     [not found]                         ` <a74b610a-732f-4595-9b93-7773542ad4c0-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-08-19  9:44                           ` Mario Valle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).