* Extract media from different .docx files to same directory
@ 2020-03-17 17:33 growig mail
[not found] ` <d4ebfb5f-13a8-49d8-9a54-2a40dd0d9482-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
0 siblings, 1 reply; 8+ messages in thread
From: growig mail @ 2020-03-17 17:33 UTC (permalink / raw)
To: pandoc-discuss
[-- Attachment #1.1: Type: text/plain, Size: 1134 bytes --]
Goal
I'm converting some .docx to .md with pandoc. These .docx have images that,
after conversion, were being placed in a directory
(markdown-repository/media/) and it's URL was being referenced in the
resulting .md file.
So the goal is to have the resulting .md files with links pointing to the
proper images stored in markdown-repository/media/. For this to happen, all
images under markdown-repository/media/ need to have an unique name.
The problem
For each conversion, the images were being smashed by the last conversion,
because pandocs doesn't track the image names, it creates image1.png,
image2.png, image3.png, etc... for each converted file.
How can i do this? Does pandoc have any option for this?
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/d4ebfb5f-13a8-49d8-9a54-2a40dd0d9482%40googlegroups.com.
[-- Attachment #1.2: Type: text/html, Size: 1626 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Extract media from different .docx files to same directory
[not found] ` <d4ebfb5f-13a8-49d8-9a54-2a40dd0d9482-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-03-18 12:00 ` EBkysko
[not found] ` <01bc2b4a-fa92-4b39-a31c-f95d0abf2272-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
0 siblings, 1 reply; 8+ messages in thread
From: EBkysko @ 2020-03-18 12:00 UTC (permalink / raw)
To: pandoc-discuss
[-- Attachment #1.1: Type: text/plain, Size: 1464 bytes --]
I hoped "--id-prefix" would work, but no go, it only works for footnotes.
Here's a *crude* solution as a Lua filter to prefix the images with the md
output file name (I used "--extract-media" option in my tests) :
----------
local outputfile = PANDOC_STATE.output_file:gsub("%.md", "")
local function prefixName(s)
return s:gsub("([^/]*)$", outputfile .. "_%1")
end
function Image(img)
img.src = prefixName(img.src)
return img
end
function Pandoc(doc)
for fp, mt, contents in pandoc.mediabag.items() do
local fpnew = prefixName(fp)
pandoc.mediabag.insert(fpnew, mt, contents)
pandoc.mediabag.delete(fp)
end
end
----------
I thought changing the mediabag files names would suffice, but I also
needed to make change in Image.
You could use something else for the prefix (eg sha1 hashes of image
content, which would avoid duplicates under different names).
Maybe one could simple do a file rename in the mediabag, instead of
creating a new one and deleting old one.
As I said, it's crude. Maybe I missed a simpler solution.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/01bc2b4a-fa92-4b39-a31c-f95d0abf2272%40googlegroups.com.
[-- Attachment #1.2: Type: text/html, Size: 9546 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Extract media from different .docx files to same directory
[not found] ` <01bc2b4a-fa92-4b39-a31c-f95d0abf2272-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-03-18 13:15 ` EBkysko
[not found] ` <525ed042-e47c-4fc5-b13f-91560718ba83-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
0 siblings, 1 reply; 8+ messages in thread
From: EBkysko @ 2020-03-18 13:15 UTC (permalink / raw)
To: pandoc-discuss
[-- Attachment #1.1: Type: text/plain, Size: 746 bytes --]
... although one must ask why the "items" loop above doesn't get caught in
an infinite loop: we add to the mediabag, and the iteration should get to
new items eventually, and so on, no?
Didn't happen in my limited tests, but could, so above code suspicious,
unless we can know if iterator *only *iterates on original list of items.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/525ed042-e47c-4fc5-b13f-91560718ba83%40googlegroups.com.
[-- Attachment #1.2: Type: text/html, Size: 1125 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Extract media from different .docx files to same directory
[not found] ` <525ed042-e47c-4fc5-b13f-91560718ba83-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-03-18 20:19 ` BPJ
[not found] ` <CADAJKhABAeRjQ9RAKbxo+-OPgd5NfCqiy1XP-iDCgKaTkxXYVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 8+ messages in thread
From: BPJ @ 2020-03-18 20:19 UTC (permalink / raw)
To: pandoc-discuss
[-- Attachment #1: Type: text/plain, Size: 1631 bytes --]
To be safe I would do this (untested):
```lua
for _,f in ipairs(pandoc.mediabag.list()) do
local fp = f.path
local mt, contents = pandoc.mediabag.lookup(fp)
-- the rest as before
end
```
Den ons 18 mars 2020 14:16EBkysko <ebkysko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> skrev:
> ... although one must ask why the "items" loop above doesn't get caught in
> an infinite loop: we add to the mediabag, and the iteration should get to
> new items eventually, and so on, no?
>
> Didn't happen in my limited tests, but could, so above code suspicious,
> unless we can know if iterator *only *iterates on original list of items.
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/525ed042-e47c-4fc5-b13f-91560718ba83%40googlegroups.com
> <https://groups.google.com/d/msgid/pandoc-discuss/525ed042-e47c-4fc5-b13f-91560718ba83%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhABAeRjQ9RAKbxo%2B-OPgd5NfCqiy1XP-iDCgKaTkxXYVg%40mail.gmail.com.
[-- Attachment #2: Type: text/html, Size: 2712 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Extract media from different .docx files to same directory
[not found] ` <CADAJKhABAeRjQ9RAKbxo+-OPgd5NfCqiy1XP-iDCgKaTkxXYVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2020-03-18 22:36 ` EBkysko
[not found] ` <6d556139-e107-4ad4-a99d-62b07bd9ad3d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
0 siblings, 1 reply; 8+ messages in thread
From: EBkysko @ 2020-03-18 22:36 UTC (permalink / raw)
To: pandoc-discuss
[-- Attachment #1.1: Type: text/plain, Size: 1124 bytes --]
ah, yes, good, that way we're sure we iterate on a fixed list, a snapshot
before the changes.
so with this, and a check on the mime type, we get something like :
-----
local outputfile = PANDOC_STATE.output_file:gsub("%.md", "")
local function prefixName(s)
return s:gsub("([^/]*)$", outputfile .. "_%1")
end
function Image(img)
img.src = prefixName(img.src)
return img
end
function Pandoc(doc)
for _,f in ipairs(pandoc.mediabag.list()) do
local fp = f.path
local mt, contents = pandoc.mediabag.lookup(fp)
if mt:match("^[^/]*") == "image" then
local fpnew = prefixName(fp)
pandoc.mediabag.insert(fpnew, mt, contents)
pandoc.mediabag.delete(fp)
end
end
end
-----
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/6d556139-e107-4ad4-a99d-62b07bd9ad3d%40googlegroups.com.
[-- Attachment #1.2: Type: text/html, Size: 11317 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Extract media from different .docx files to same directory
[not found] ` <6d556139-e107-4ad4-a99d-62b07bd9ad3d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-03-18 23:41 ` BPJ
[not found] ` <CADAJKhBfF-_hkBLJ0M5AnAL4aWJMs03JDbNj3wjjD7P6_-6fcA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 8+ messages in thread
From: BPJ @ 2020-03-18 23:41 UTC (permalink / raw)
To: pandoc-discuss
[-- Attachment #1: Type: text/plain, Size: 2139 bytes --]
Also there is no guarantee that the file extension on the Markdown file is
`.md`. Some people use `.mkd` or `.txt` and it may even be an `.rst` file
so it may be better to use the old and tried generic pattern `%.[^%.]*$`,
which matches any extension, to match the file extension.
Den ons 18 mars 2020 23:37EBkysko <ebkysko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> skrev:
> ah, yes, good, that way we're sure we iterate on a fixed list, a snapshot
> before the changes.
>
> so with this, and a check on the mime type, we get something like :
>
>
> -----
> local outputfile = PANDOC_STATE.output_file:gsub("%.md", "")
>
> local function prefixName(s)
> return s:gsub("([^/]*)$", outputfile .. "_%1")
> end
>
> function Image(img)
> img.src = prefixName(img.src)
> return img
> end
>
> function Pandoc(doc)
> for _,f in ipairs(pandoc.mediabag.list()) do
> local fp = f.path
> local mt, contents = pandoc.mediabag.lookup(fp)
> if mt:match("^[^/]*") == "image" then
> local fpnew = prefixName(fp)
> pandoc.mediabag.insert(fpnew, mt, contents)
> pandoc.mediabag.delete(fp)
> end
> end
> end
> -----
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/6d556139-e107-4ad4-a99d-62b07bd9ad3d%40googlegroups.com
> <https://groups.google.com/d/msgid/pandoc-discuss/6d556139-e107-4ad4-a99d-62b07bd9ad3d%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhBfF-_hkBLJ0M5AnAL4aWJMs03JDbNj3wjjD7P6_-6fcA%40mail.gmail.com.
[-- Attachment #2: Type: text/html, Size: 8220 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Extract media from different .docx files to same directory
[not found] ` <CADAJKhBfF-_hkBLJ0M5AnAL4aWJMs03JDbNj3wjjD7P6_-6fcA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2020-03-18 23:54 ` EBkysko
[not found] ` <a74b610a-732f-4595-9b93-7773542ad4c0-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
0 siblings, 1 reply; 8+ messages in thread
From: EBkysko @ 2020-03-18 23:54 UTC (permalink / raw)
To: pandoc-discuss
[-- Attachment #1.1: Type: text/plain, Size: 551 bytes --]
True.
I personally use .txt most of the time, only used .md to follow OP's
choice, but yes, costs nothing to widen the choice of extension.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/a74b610a-732f-4595-9b93-7773542ad4c0%40googlegroups.com.
[-- Attachment #1.2: Type: text/html, Size: 884 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Extract media from different .docx files to same directory
[not found] ` <a74b610a-732f-4595-9b93-7773542ad4c0-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-08-19 9:44 ` Mario Valle
0 siblings, 0 replies; 8+ messages in thread
From: Mario Valle @ 2020-08-19 9:44 UTC (permalink / raw)
To: pandoc-discuss
[-- Attachment #1.1: Type: text/plain, Size: 1182 bytes --]
If this could be of any help, I had the same problem and resorted to use
the following script that worked quite well:
mkdir img
cnt=0
for i in *.docx
do
k=`printf "%02d\n" $cnt`
n="doc$k.md"
pandoc -f docx --atx-headers --wrap=none --extract-media=img$k -t
commonmark-smart-raw_html "$i" | \
sed "s/img$k\/media\//img\/abcd-$k-/" > "$n"
for t in img$k/media/image*.jpeg
do
mv $t img/abcd-$k-`basename $t`
done
rm -rf img$k
cnt=$((cnt+1))
done
cat doc*.md >abcd.md
Il giorno giovedì 19 marzo 2020 alle 00:54:11 UTC+1 EBkysko ha scritto:
> True.
> I personally use .txt most of the time, only used .md to follow OP's
> choice, but yes, costs nothing to widen the choice of extension.
>
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/f28bad27-592c-421c-8985-b5a7e654d0ecn%40googlegroups.com.
[-- Attachment #1.2: Type: text/html, Size: 2094 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2020-08-19 9:44 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-17 17:33 Extract media from different .docx files to same directory growig mail
[not found] ` <d4ebfb5f-13a8-49d8-9a54-2a40dd0d9482-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-03-18 12:00 ` EBkysko
[not found] ` <01bc2b4a-fa92-4b39-a31c-f95d0abf2272-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-03-18 13:15 ` EBkysko
[not found] ` <525ed042-e47c-4fc5-b13f-91560718ba83-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-03-18 20:19 ` BPJ
[not found] ` <CADAJKhABAeRjQ9RAKbxo+-OPgd5NfCqiy1XP-iDCgKaTkxXYVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-03-18 22:36 ` EBkysko
[not found] ` <6d556139-e107-4ad4-a99d-62b07bd9ad3d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-03-18 23:41 ` BPJ
[not found] ` <CADAJKhBfF-_hkBLJ0M5AnAL4aWJMs03JDbNj3wjjD7P6_-6fcA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-03-18 23:54 ` EBkysko
[not found] ` <a74b610a-732f-4595-9b93-7773542ad4c0-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-08-19 9:44 ` Mario Valle
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).