Docx extract media as original filenames

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* Docx extract media as original filenames
@ 2015-11-25  9:52 Róbert Nagy
       [not found] ` <c81870bc-fd16-4d2c-95b8-5dda5937f10c-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Róbert Nagy @ 2015-11-25  9:52 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1950 bytes --]

Hello everyone!

I want to covert a docx to a docbook file with --extract-media, so i can 
upload the pictures automatically into a database.
Is there any "hack" to keep the original filename for the extracted 
picture's name + imegaobject fileref's value?

For example...
I have a file what has been pasted into the docx file with this name -> 
picture_upload.png. I want to extract into the /media folder as 
'picture_upload.png', and set the docbook's imegamedia fileref's value to 
'picture_upload.png' or './media/picture_upload.png'.

In the <file>/word/document.xml there is this block, what contains the 
original filename:
<pic:pic xmlns:pic=
"http://schemas.openxmlformats.org/drawingml/2006/picture">
    <pic:nvPicPr>
        <pic:cNvPr id="0" name="picture_upload.png"/>
            <pic:cNvPicPr/>
     </pic:nvPicPr>
     <pic:blipFill>
         <a:blip r:embed="rId21"/>
             <a:stretch>
                 <a:fillRect/>
             </a:stretch>
      </pic:blipFill>
      ...


Its reference is in the <file>/word/_rels/document.xml:
Target="media/image110.png"/><Relationship Id="rId21" Type=
"http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"

And the docbook output is:
<imagedata fileref="./media/image110.png" />

The file's name in the './media' folder is the same as the "Target" value.

Any idea will help :)
Thank you.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/c81870bc-fd16-4d2c-95b8-5dda5937f10c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 7443 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Docx extract media as original filenames
       [not found] ` <c81870bc-fd16-4d2c-95b8-5dda5937f10c-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2015-11-25 19:49   ` Jesse Rosenthal
       [not found]     ` <m137vtuaeh.fsf-4GNroTWusrE@public.gmane.org>
  2019-10-14 19:39   ` Daniel Fuchs
  1 sibling, 1 reply; 4+ messages in thread
From: Jesse Rosenthal @ 2015-11-25 19:49 UTC (permalink / raw)
  To: Róbert Nagy, pandoc-discuss

Hello,

Róbert Nagy <rnagy.monguz-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> For example...
> I have a file what has been pasted into the docx file with this name -> 
> picture_upload.png. I want to extract into the /media folder as 
> 'picture_upload.png', and set the docbook's imegamedia fileref's value to 
> 'picture_upload.png' or './media/picture_upload.png'.

Here's the problem. Let's say you have two files named "pic.jpg":
"~/pics/pic.jpg" and "~/Downloads/pic.jpg." They would *both* show as
having "pic.jpg" in their name field. Or maybe you added one, and then
added another later on a different computer. In any case, that name
attribute isn't dependable. I'm not sure if it's even required. Without
looking at the spec, I'm also not sure whether the characters in it are
guaranteed to be legal for a filename. (And that's not even going into
the issue of going from a case-sensitive FS to a case-insensitive one.)

Seems I suppose it's possible to do it by sanitizing and then adding numerical
suffixes if there's a collision. Seems like it might be a can of worms,
though, and it would probably create inconsitencies with other
input formats.

Best,
Jesse

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/m137vtuaeh.fsf%40jhu.edu.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Docx extract media as original filenames
       [not found]     ` <m137vtuaeh.fsf-4GNroTWusrE@public.gmane.org>
@ 2015-11-26  8:10       ` Róbert Nagy
  0 siblings, 0 replies; 4+ messages in thread
From: Róbert Nagy @ 2015-11-26  8:10 UTC (permalink / raw)
  To: pandoc-discuss; +Cc: rnagy.monguz-Re5JQEeQqe8AvxtiuMwx3w

[-- Attachment #1.1: Type: text/plain, Size: 2401 bytes --]

Thank you for the reply!

The main problems are with the docbook's fileref's value.
Here is the complete example what i want to do:
1. Upload a bunch of docx file, convert into a docbook, because my db's 
main content is a docbook file.
2. Save the converted file with the images into a database.
3. Convert the docbook again to a html, when the user wants to read it in a 
browser.

Because the images are in the same database, the final <img> tag's "src" 
attribute's - with a little modification - will be this: 
http://.../rest/<file's identifier>.
In this case it's possible to show a complete html content in the main 
webapplication, without copy the source images into it, and if I can't 
change the extracted filenames
(maybe with a prefix), there are going to be a lot of files in the database 
what's name is just "image1". The second problem, if i want to update a 
picture, i won't know
which one is it.

2015. november 25., szerda 20:49:45 UTC+1 időpontban Jesse Rosenthal a 
következőt írta:
>
>
> Hello, 
>
> Róbert Nagy <rnagy....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org <javascript:>> writes: 
>
> > For example... 
> > I have a file what has been pasted into the docx file with this name -> 
> > picture_upload.png. I want to extract into the /media folder as 
> > 'picture_upload.png', and set the docbook's imegamedia fileref's value 
> to 
> > 'picture_upload.png' or './media/picture_upload.png'. 
>
> Here's the problem. Let's say you have two files named "pic.jpg": 
> "~/pics/pic.jpg" and "~/Downloads/pic.jpg." They would *both* show as 
> having "pic.jpg" in their name field. Or maybe you added one, and then 
> added another later on a different computer. In any case, that name 
> attribute isn't dependable.
> ...

> Best, 
> Jesse 
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/101a2b4f-667a-4687-adb4-d45f45b22742%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 3538 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Docx extract media as original filenames
       [not found] ` <c81870bc-fd16-4d2c-95b8-5dda5937f10c-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2015-11-25 19:49   ` Jesse Rosenthal
@ 2019-10-14 19:39   ` Daniel Fuchs
  1 sibling, 0 replies; 4+ messages in thread
From: Daniel Fuchs @ 2019-10-14 19:39 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 2096 bytes --]

Guys,

Is there any update regarding prefix the image names or change it somehow?

Thanks,

Daniel

On Wednesday, November 25, 2015 at 7:52:52 AM UTC-2, Róbert Nagy wrote:
>
> Hello everyone!
>
> I want to covert a docx to a docbook file with --extract-media, so i can 
> upload the pictures automatically into a database.
> Is there any "hack" to keep the original filename for the extracted 
> picture's name + imegaobject fileref's value?
>
> For example...
> I have a file what has been pasted into the docx file with this name -> 
> picture_upload.png. I want to extract into the /media folder as 
> 'picture_upload.png', and set the docbook's imegamedia fileref's value to 
> 'picture_upload.png' or './media/picture_upload.png'.
>
> In the <file>/word/document.xml there is this block, what contains the 
> original filename:
> <pic:pic xmlns:pic="
> http://schemas.openxmlformats.org/drawingml/2006/picture">
>     <pic:nvPicPr>
>         <pic:cNvPr id="0" name="picture_upload.png"/>
>             <pic:cNvPicPr/>
>      </pic:nvPicPr>
>      <pic:blipFill>
>          <a:blip r:embed="rId21"/>
>              <a:stretch>
>                  <a:fillRect/>
>              </a:stretch>
>       </pic:blipFill>
>       ...
>
>
> Its reference is in the <file>/word/_rels/document.xml:
> Target="media/image110.png"/><Relationship Id="rId21" Type="
> http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"
>
> And the docbook output is:
> <imagedata fileref="./media/image110.png" />
>
> The file's name in the './media' folder is the same as the "Target" value.
>
> Any idea will help :)
> Thank you.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/396dae4f-6e94-4eff-9c5b-a99f303961f9%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 6618 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-10-14 19:39 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-25  9:52 Docx extract media as original filenames Róbert Nagy
     [not found] ` <c81870bc-fd16-4d2c-95b8-5dda5937f10c-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2015-11-25 19:49   ` Jesse Rosenthal
     [not found]     ` <m137vtuaeh.fsf-4GNroTWusrE@public.gmane.org>
2015-11-26  8:10       ` Róbert Nagy
2019-10-14 19:39   ` Daniel Fuchs

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).