public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* Cannot: (i) extract images from the input docx files; (ii) convert 2 docx input files into 1 html
@ 2015-01-19  2:29 Tl Yim
       [not found] ` <4e05f6a9-4c84-49f2-bbf3-f0b5c5479a59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Tl Yim @ 2015-01-19  2:29 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


[-- Attachment #1.1: Type: text/plain, Size: 1375 bytes --]

Will appreciate very much your comments to indicate what I did wrong below. 

I try to use the following command to 
(i) extract images from the docx input files (attached);
(ii) combine the 2 docx input files to convert them into 1 html output file

*pandoc -s -S --extract-media=D:/img -f docx -t html -o EMnFraud.html 
EMnFraud.docx EMnFraud_outline.docx*

Tried also the equivalent form below without success: 
(1) no extraction of images and 
(ii) the html output file contains only contents from the first docx input 
file

*pandoc -s -S --extract-media=D:/img -f docx -t html EMnFraud.docx 
EMnFraud_outline.docx  >  EMnFraud.html *

I use Windows XP and pandoc 1.13.2

Thanks in advance. 
(Tried a whole day without success. Could not find any useful relevant 
information with Google search.)


-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/4e05f6a9-4c84-49f2-bbf3-f0b5c5479a59%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 2109 bytes --]

[-- Attachment #2: EMnFraud.docx --]
[-- Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document, Size: 1253522 bytes --]

[-- Attachment #3: EMnFraud_outline.docx --]
[-- Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document, Size: 68198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cannot: (i) extract images from the input docx files; (ii) convert 2 docx input files into 1 html
       [not found] ` <4e05f6a9-4c84-49f2-bbf3-f0b5c5479a59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2015-01-19  2:38   ` Matthew Pickering
       [not found]     ` <CALuQ0m9nkPNtSDLDCSfhzWsGy7Kg77L5DSpUu4eFq5AF1od8_A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Matthew Pickering @ 2015-01-19  2:38 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Hi,

Sorry I made you waste a day. The error message should definitely be clearer.

Unlike text files, you can only specify one binary file on the command
line at the moment. This is definitely something which could be
implemented in the future though!

Matt

On Mon, Jan 19, 2015 at 2:29 AM, Tl Yim <acdryim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Will appreciate very much your comments to indicate what I did wrong below.
>
> I try to use the following command to
> (i) extract images from the docx input files (attached);
> (ii) combine the 2 docx input files to convert them into 1 html output file
>
> pandoc -s -S --extract-media=D:/img -f docx -t html -o EMnFraud.html
> EMnFraud.docx EMnFraud_outline.docx
>
> Tried also the equivalent form below without success:
> (1) no extraction of images and
> (ii) the html output file contains only contents from the first docx input
> file
>
> pandoc -s -S --extract-media=D:/img -f docx -t html EMnFraud.docx
> EMnFraud_outline.docx  >  EMnFraud.html
>
> I use Windows XP and pandoc 1.13.2
>
> Thanks in advance.
> (Tried a whole day without success. Could not find any useful relevant
> information with Google search.)
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/4e05f6a9-4c84-49f2-bbf3-f0b5c5479a59%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cannot: (i) extract images from the input docx files; (ii) convert 2 docx input files into 1 html
       [not found]     ` <CALuQ0m9nkPNtSDLDCSfhzWsGy7Kg77L5DSpUu4eFq5AF1od8_A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-19 13:29       ` Andrew Yim
       [not found]         ` <CAFbbbDMWd_0vfWtqhOfFLZt3neA7P6w-zNfZ=Je6Ks0u9rjsXw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Yim @ 2015-01-19 13:29 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 3791 bytes --]

Thanks, Matt, for confirming that converting multiple docx input files in
one go is not supported yet.

Do you have any clue for why pandoc does not extract images from my first
docx input file EMnFraud.docx ?
(I tried that independently without a second docx input file but still
couldn't get the images in the file extracted.)

Did I specify the extract-media directory in the Windows environment
incorrectly? (Tried also "D:/img" with quotation marks and D:\img without
success.)

Andrew

On Mon, Jan 19, 2015 at 2:38 AM, Matthew Pickering <
matthewtpickering-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> Hi,
>
> Sorry I made you waste a day. The error message should definitely be
> clearer.
>
> Unlike text files, you can only specify one binary file on the command
> line at the moment. This is definitely something which could be
> implemented in the future though!
>
> Matt
>
> On Mon, Jan 19, 2015 at 2:29 AM, Tl Yim <acdryim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > Will appreciate very much your comments to indicate what I did wrong
> below.
> >
> > I try to use the following command to
> > (i) extract images from the docx input files (attached);
> > (ii) combine the 2 docx input files to convert them into 1 html output
> file
> >
> > pandoc -s -S --extract-media=D:/img -f docx -t html -o EMnFraud.html
> > EMnFraud.docx EMnFraud_outline.docx
> >
> > Tried also the equivalent form below without success:
> > (1) no extraction of images and
> > (ii) the html output file contains only contents from the first docx
> input
> > file
> >
> > pandoc -s -S --extract-media=D:/img -f docx -t html EMnFraud.docx
> > EMnFraud_outline.docx  >  EMnFraud.html
> >
> > I use Windows XP and pandoc 1.13.2
> >
> > Thanks in advance.
> > (Tried a whole day without success. Could not find any useful relevant
> > information with Google search.)
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/pandoc-discuss/4e05f6a9-4c84-49f2-bbf3-f0b5c5479a59%40googlegroups.com
> .
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "pandoc-discuss" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/pandoc-discuss/Xsg-phpJSCk/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/CALuQ0m9nkPNtSDLDCSfhzWsGy7Kg77L5DSpUu4eFq5AF1od8_A%40mail.gmail.com
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFbbbDMWd_0vfWtqhOfFLZt3neA7P6w-zNfZ%3DJe6Ks0u9rjsXw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 5901 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cannot: (i) extract images from the input docx files; (ii) convert 2 docx input files into 1 html
       [not found]         ` <CAFbbbDMWd_0vfWtqhOfFLZt3neA7P6w-zNfZ=Je6Ks0u9rjsXw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-19 13:45           ` Matthew Pickering
  2015-01-19 15:22           ` Jesse Rosenthal
  1 sibling, 0 replies; 11+ messages in thread
From: Matthew Pickering @ 2015-01-19 13:45 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw; +Cc: Jesse Rosenthal

I'm not too sure but I have cced the author of that part of pandoc and
included the relevant XML below.

https://gist.github.com/mpickering/b93b9764e275ff8ff4a4

On Mon, Jan 19, 2015 at 1:29 PM, Andrew Yim <andrew.yim-QDVy5qH0ceL2fBVCVOL8/A@public.gmane.org> wrote:
> Thanks, Matt, for confirming that converting multiple docx input files in
> one go is not supported yet.
>
> Do you have any clue for why pandoc does not extract images from my first
> docx input file EMnFraud.docx ?
> (I tried that independently without a second docx input file but still
> couldn't get the images in the file extracted.)
>
> Did I specify the extract-media directory in the Windows environment
> incorrectly? (Tried also "D:/img" with quotation marks and D:\img without
> success.)
>
> Andrew
>
> On Mon, Jan 19, 2015 at 2:38 AM, Matthew Pickering
> <matthewtpickering-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>> Hi,
>>
>> Sorry I made you waste a day. The error message should definitely be
>> clearer.
>>
>> Unlike text files, you can only specify one binary file on the command
>> line at the moment. This is definitely something which could be
>> implemented in the future though!
>>
>> Matt
>>
>> On Mon, Jan 19, 2015 at 2:29 AM, Tl Yim <acdryim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> > Will appreciate very much your comments to indicate what I did wrong
>> > below.
>> >
>> > I try to use the following command to
>> > (i) extract images from the docx input files (attached);
>> > (ii) combine the 2 docx input files to convert them into 1 html output
>> > file
>> >
>> > pandoc -s -S --extract-media=D:/img -f docx -t html -o EMnFraud.html
>> > EMnFraud.docx EMnFraud_outline.docx
>> >
>> > Tried also the equivalent form below without success:
>> > (1) no extraction of images and
>> > (ii) the html output file contains only contents from the first docx
>> > input
>> > file
>> >
>> > pandoc -s -S --extract-media=D:/img -f docx -t html EMnFraud.docx
>> > EMnFraud_outline.docx  >  EMnFraud.html
>> >
>> > I use Windows XP and pandoc 1.13.2
>> >
>> > Thanks in advance.
>> > (Tried a whole day without success. Could not find any useful relevant
>> > information with Google search.)
>> >
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "pandoc-discuss" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an
>> > email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> > To view this discussion on the web visit
>> >
>> > https://groups.google.com/d/msgid/pandoc-discuss/4e05f6a9-4c84-49f2-bbf3-f0b5c5479a59%40googlegroups.com.
>> > For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "pandoc-discuss" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/pandoc-discuss/Xsg-phpJSCk/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/pandoc-discuss/CALuQ0m9nkPNtSDLDCSfhzWsGy7Kg77L5DSpUu4eFq5AF1od8_A%40mail.gmail.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/CAFbbbDMWd_0vfWtqhOfFLZt3neA7P6w-zNfZ%3DJe6Ks0u9rjsXw%40mail.gmail.com.
>
> For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cannot: (i) extract images from the input docx files; (ii) convert 2 docx input files into 1 html
       [not found]         ` <CAFbbbDMWd_0vfWtqhOfFLZt3neA7P6w-zNfZ=Je6Ks0u9rjsXw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-01-19 13:45           ` Matthew Pickering
@ 2015-01-19 15:22           ` Jesse Rosenthal
       [not found]             ` <87k30i7q2d.fsf-4GNroTWusrE@public.gmane.org>
  1 sibling, 1 reply; 11+ messages in thread
From: Jesse Rosenthal @ 2015-01-19 15:22 UTC (permalink / raw)
  To: Andrew Yim, pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw; +Cc: Matthew Pickering

Hi Andrew,

> Do you have any clue for why pandoc does not extract images from my first
> docx input file EMnFraud.docx ?
> (I tried that independently without a second docx input file but still
> couldn't get the images in the file extracted.)

The issue here is that the imagedata in this document is under the "v:"
namespace -- i.e., it's not in ooxml, but rather in VML (vector markup
language). This is a deprecated format, and isn't used in more recent
versions of word (which use drawingxml for images; and omml for
equations). It was, I think, Microsoft's attempt to do their NIH MS
thing with SVG for IE. It thankfully didn't catch, and they gave up on
it.

I'll poke around to see if there's a simple way to rescure
`<v:imageData>` and do the writing thing with it---after all, we don't
parse all of drawingml, we just get the images out. But since it's tied
up in a whole deprecated xml language, that might end up introducing
other problems.  I'd think the safest thing would be to add something to
the readme saying that VML features are deprecated and are not
supported. It would be nice if we could find out if there's a simple way
to convert them to more modern formats.

Andrew: Would you mind sending a minimal test file with an image that
doesn't come out? It will help me see if supporting vml images is a
possibility.

Best,
Jesse


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cannot: (i) extract images from the input docx files; (ii) convert 2 docx input files into 1 html
       [not found]             ` <87k30i7q2d.fsf-4GNroTWusrE@public.gmane.org>
@ 2015-01-19 16:39               ` Jesse Rosenthal
       [not found]                 ` <87ppaad8t1.fsf-4GNroTWusrE@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Jesse Rosenthal @ 2015-01-19 16:39 UTC (permalink / raw)
  To: Andrew Yim, pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw; +Cc: Matthew Pickering

Okay -- I implemented this change to read images on old versions of
word. If Andrew (or anyone else with an older word using vml) could send
a minimal file to use as a test case, I'll push it. (Unfortunately, I
can't make the test file myself, since I don't have access to a version
of word that uses vml.)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cannot: (i) extract images from the input docx files; (ii) convert 2 docx input files into 1 html
       [not found]                 ` <87ppaad8t1.fsf-4GNroTWusrE@public.gmane.org>
@ 2015-01-19 22:48                   ` Andrew Yim
       [not found]                     ` <CAFbbbDPWanTwpD6uXU=fqLasQkM36iAEDj7sbLN7_uih8H+1Kw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Yim @ 2015-01-19 22:48 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


[-- Attachment #1.1: Type: text/plain, Size: 2014 bytes --]

Hi Jesse

Thanks. Not sure if the attached docx file can serve the test function. It
has 7 pages with 3 images, taken out from the file in my original post. It
was saved in Word 2007 .docx format with my Word 2003. The pdf shows what's
insider the file. Would the extracted images be saved as .png files or what
format?

Thanks again.

On Mon, Jan 19, 2015 at 4:39 PM, Jesse Rosenthal <jrosenthal-4GNroTWusrE@public.gmane.org> wrote:

> Okay -- I implemented this change to read images on old versions of
> word. If Andrew (or anyone else with an older word using vml) could send
> a minimal file to use as a test case, I'll push it. (Unfortunately, I
> can't make the test file myself, since I don't have access to a version
> of word that uses vml.)
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "pandoc-discuss" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/pandoc-discuss/Xsg-phpJSCk/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/87ppaad8t1.fsf%40jhu.edu.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFbbbDPWanTwpD6uXU%3DfqLasQkM36iAEDj7sbLN7_uih8H%2B1Kw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 3308 bytes --]

[-- Attachment #2: SevenPages3Images.pdf --]
[-- Type: application/pdf, Size: 412636 bytes --]

[-- Attachment #3: SevenPages3Images.docx --]
[-- Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document, Size: 618170 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cannot: (i) extract images from the input docx files; (ii) convert 2 docx input files into 1 html
       [not found]                     ` <CAFbbbDPWanTwpD6uXU=fqLasQkM36iAEDj7sbLN7_uih8H+1Kw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-19 23:56                       ` Jesse Rosenthal
       [not found]                         ` <87twzmcojp.fsf-4GNroTWusrE@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Jesse Rosenthal @ 2015-01-19 23:56 UTC (permalink / raw)
  To: Andrew Yim, pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Hi Andrew,

This is actually a bit heavy for a test case, since it has to become
part of the permanent repo -- and because it also exhibits a lot of
behavior that might lead to failing tests for different reasons. What
would be great is something with one line of text and one small image,
that the current pandoc doesn't convert properly.

The saved images would be in their original format.

Best,
Jesse

Andrew Yim <andrew.yim-QDVy5qH0ceL2fBVCVOL8/A@public.gmane.org> writes:

> Hi Jesse
>
> Thanks. Not sure if the attached docx file can serve the test function. It
> has 7 pages with 3 images, taken out from the file in my original post. It
> was saved in Word 2007 .docx format with my Word 2003. The pdf shows what's
> insider the file. Would the extracted images be saved as .png files or what
> format?
>
> Thanks again.
>
> On Mon, Jan 19, 2015 at 4:39 PM, Jesse Rosenthal <jrosenthal-4GNroTWusrE@public.gmane.org> wrote:
>
>> Okay -- I implemented this change to read images on old versions of
>> word. If Andrew (or anyone else with an older word using vml) could send
>> a minimal file to use as a test case, I'll push it. (Unfortunately, I
>> can't make the test file myself, since I don't have access to a version
>> of word that uses vml.)
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "pandoc-discuss" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/pandoc-discuss/Xsg-phpJSCk/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/pandoc-discuss/87ppaad8t1.fsf%40jhu.edu.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFbbbDPWanTwpD6uXU%3DfqLasQkM36iAEDj7sbLN7_uih8H%2B1Kw%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cannot: (i) extract images from the input docx files; (ii) convert 2 docx input files into 1 html
       [not found]                         ` <87twzmcojp.fsf-4GNroTWusrE@public.gmane.org>
@ 2015-01-20 12:24                           ` Andrew Yim
       [not found]                             ` <CAFbbbDMo7OOWCt2-L9euCGmWqU_LDWMYPACwvZCeaZ9JRWiUWg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Yim @ 2015-01-20 12:24 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


[-- Attachment #1.1: Type: text/plain, Size: 4073 bytes --]

Hi Jesse

Would this one be ok? Thanks.

Andrew

On Mon, Jan 19, 2015 at 11:56 PM, Jesse Rosenthal <jrosenthal-4GNroTWusrE@public.gmane.org>
wrote:

> Hi Andrew,
>
> This is actually a bit heavy for a test case, since it has to become
> part of the permanent repo -- and because it also exhibits a lot of
> behavior that might lead to failing tests for different reasons. What
> would be great is something with one line of text and one small image,
> that the current pandoc doesn't convert properly.
>
> The saved images would be in their original format.
>
> Best,
> Jesse
>
> Andrew Yim <andrew.yim-QDVy5qH0ceL2fBVCVOL8/A@public.gmane.org> writes:
>
> > Hi Jesse
> >
> > Thanks. Not sure if the attached docx file can serve the test function.
> It
> > has 7 pages with 3 images, taken out from the file in my original post.
> It
> > was saved in Word 2007 .docx format with my Word 2003. The pdf shows
> what's
> > insider the file. Would the extracted images be saved as .png files or
> what
> > format?
> >
> > Thanks again.
> >
> > On Mon, Jan 19, 2015 at 4:39 PM, Jesse Rosenthal <jrosenthal-4GNroTWusrE@public.gmane.org>
> wrote:
> >
> >> Okay -- I implemented this change to read images on old versions of
> >> word. If Andrew (or anyone else with an older word using vml) could send
> >> a minimal file to use as a test case, I'll push it. (Unfortunately, I
> >> can't make the test file myself, since I don't have access to a version
> >> of word that uses vml.)
> >>
> >> --
> >> You received this message because you are subscribed to a topic in the
> >> Google Groups "pandoc-discuss" group.
> >> To unsubscribe from this topic, visit
> >>
> https://groups.google.com/d/topic/pandoc-discuss/Xsg-phpJSCk/unsubscribe.
> >> To unsubscribe from this group and all its topics, send an email to
> >> pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> >> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> >> To view this discussion on the web visit
> >>
> https://groups.google.com/d/msgid/pandoc-discuss/87ppaad8t1.fsf%40jhu.edu.
> >> For more options, visit https://groups.google.com/d/optout.
> >>
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/CAFbbbDPWanTwpD6uXU%3DfqLasQkM36iAEDj7sbLN7_uih8H%2B1Kw%40mail.gmail.com
> .
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "pandoc-discuss" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/pandoc-discuss/Xsg-phpJSCk/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/87twzmcojp.fsf%40jhu.edu.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFbbbDMo7OOWCt2-L9euCGmWqU_LDWMYPACwvZCeaZ9JRWiUWg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 6676 bytes --]

[-- Attachment #2: OnePages1Image.pdf --]
[-- Type: application/pdf, Size: 20751 bytes --]

[-- Attachment #3: OnePages1Image.docx --]
[-- Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document, Size: 37670 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cannot: (i) extract images from the input docx files; (ii) convert 2 docx input files into 1 html
       [not found]                             ` <CAFbbbDMo7OOWCt2-L9euCGmWqU_LDWMYPACwvZCeaZ9JRWiUWg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-21 18:47                               ` Jesse Rosenthal
       [not found]                                 ` <87zj9cc6oh.fsf-4GNroTWusrE@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Jesse Rosenthal @ 2015-01-21 18:47 UTC (permalink / raw)
  To: Andrew Yim, pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Dear Andrew,

I cut some of the text out of it, and used it as a test. The fix has now
been pushed.

Note though that it will be quite easy to get weird result using these
sort of images, since they're often placed absolutely on the page, and
anchored to text in weird ways. I'll try to solve the problems as they
come up.

Best,
Jesse

Andrew Yim <andrew.yim-QDVy5qH0ceL2fBVCVOL8/A@public.gmane.org> writes:

> Hi Jesse
>
> Would this one be ok? Thanks.
>
> Andrew
>
> On Mon, Jan 19, 2015 at 11:56 PM, Jesse Rosenthal <jrosenthal-4GNroTWusrE@public.gmane.org>
> wrote:
>
>> Hi Andrew,
>>
>> This is actually a bit heavy for a test case, since it has to become
>> part of the permanent repo -- and because it also exhibits a lot of
>> behavior that might lead to failing tests for different reasons. What
>> would be great is something with one line of text and one small image,
>> that the current pandoc doesn't convert properly.
>>
>> The saved images would be in their original format.
>>
>> Best,
>> Jesse
>>
>> Andrew Yim <andrew.yim-QDVy5qH0ceL2fBVCVOL8/A@public.gmane.org> writes:
>>
>> > Hi Jesse
>> >
>> > Thanks. Not sure if the attached docx file can serve the test function.
>> It
>> > has 7 pages with 3 images, taken out from the file in my original post.
>> It
>> > was saved in Word 2007 .docx format with my Word 2003. The pdf shows
>> what's
>> > insider the file. Would the extracted images be saved as .png files or
>> what
>> > format?
>> >
>> > Thanks again.
>> >
>> > On Mon, Jan 19, 2015 at 4:39 PM, Jesse Rosenthal <jrosenthal-4GNroTWusrE@public.gmane.org>
>> wrote:
>> >
>> >> Okay -- I implemented this change to read images on old versions of
>> >> word. If Andrew (or anyone else with an older word using vml) could send
>> >> a minimal file to use as a test case, I'll push it. (Unfortunately, I
>> >> can't make the test file myself, since I don't have access to a version
>> >> of word that uses vml.)
>> >>
>> >> --
>> >> You received this message because you are subscribed to a topic in the
>> >> Google Groups "pandoc-discuss" group.
>> >> To unsubscribe from this topic, visit
>> >>
>> https://groups.google.com/d/topic/pandoc-discuss/Xsg-phpJSCk/unsubscribe.
>> >> To unsubscribe from this group and all its topics, send an email to
>> >> pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> >> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> >> To view this discussion on the web visit
>> >>
>> https://groups.google.com/d/msgid/pandoc-discuss/87ppaad8t1.fsf%40jhu.edu.
>> >> For more options, visit https://groups.google.com/d/optout.
>> >>
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> Groups "pandoc-discuss" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> > To view this discussion on the web visit
>> https://groups.google.com/d/msgid/pandoc-discuss/CAFbbbDPWanTwpD6uXU%3DfqLasQkM36iAEDj7sbLN7_uih8H%2B1Kw%40mail.gmail.com
>> .
>> > For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "pandoc-discuss" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/pandoc-discuss/Xsg-phpJSCk/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/pandoc-discuss/87twzmcojp.fsf%40jhu.edu.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFbbbDMo7OOWCt2-L9euCGmWqU_LDWMYPACwvZCeaZ9JRWiUWg%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cannot: (i) extract images from the input docx files; (ii) convert 2 docx input files into 1 html
       [not found]                                 ` <87zj9cc6oh.fsf-4GNroTWusrE@public.gmane.org>
@ 2015-01-21 22:45                                   ` Andrew Yim
  0 siblings, 0 replies; 11+ messages in thread
From: Andrew Yim @ 2015-01-21 22:45 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Thanks and appreciated. - Andrew



> On 21 Jan 2015, at 18:47, Jesse Rosenthal <jrosenthal-4GNroTWusrE@public.gmane.org> wrote:
> 
> Dear Andrew,
> 
> I cut some of the text out of it, and used it as a test. The fix has now
> been pushed.
> 
> Note though that it will be quite easy to get weird result using these
> sort of images, since they're often placed absolutely on the page, and
> anchored to text in weird ways. I'll try to solve the problems as they
> come up.
> 
> Best,
> Jesse
> 
> Andrew Yim <andrew.yim-QDVy5qH0ceL2fBVCVOL8/A@public.gmane.org> writes:
> 
>> Hi Jesse
>> 
>> Would this one be ok? Thanks.
>> 
>> Andrew
>> 
>> On Mon, Jan 19, 2015 at 11:56 PM, Jesse Rosenthal <jrosenthal-4GNroTWusrE@public.gmane.org>
>> wrote:
>> 
>>> Hi Andrew,
>>> 
>>> This is actually a bit heavy for a test case, since it has to become
>>> part of the permanent repo -- and because it also exhibits a lot of
>>> behavior that might lead to failing tests for different reasons. What
>>> would be great is something with one line of text and one small image,
>>> that the current pandoc doesn't convert properly.
>>> 
>>> The saved images would be in their original format.
>>> 
>>> Best,
>>> Jesse
>>> 
>>> Andrew Yim <andrew.yim-QDVy5qH0ceL2fBVCVOL8/A@public.gmane.org> writes:
>>> 
>>>> Hi Jesse
>>>> 
>>>> Thanks. Not sure if the attached docx file can serve the test function.
>>> It
>>>> has 7 pages with 3 images, taken out from the file in my original post.
>>> It
>>>> was saved in Word 2007 .docx format with my Word 2003. The pdf shows
>>> what's
>>>> insider the file. Would the extracted images be saved as .png files or
>>> what
>>>> format?
>>>> 
>>>> Thanks again.
>>>> 
>>>>> On Mon, Jan 19, 2015 at 4:39 PM, Jesse Rosenthal <jrosenthal-4GNroTWusrE@public.gmane.org>
>>>> wrote:
>>>> 
>>>>> Okay -- I implemented this change to read images on old versions of
>>>>> word. If Andrew (or anyone else with an older word using vml) could send
>>>>> a minimal file to use as a test case, I'll push it. (Unfortunately, I
>>>>> can't make the test file myself, since I don't have access to a version
>>>>> of word that uses vml.)
>>>>> 
>>>>> --
>>>>> You received this message because you are subscribed to a topic in the
>>>>> Google Groups "pandoc-discuss" group.
>>>>> To unsubscribe from this topic, visit
>>> https://groups.google.com/d/topic/pandoc-discuss/Xsg-phpJSCk/unsubscribe.
>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>> pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>>>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/pandoc-discuss/87ppaad8t1.fsf%40jhu.edu.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>> 
>>>> --
>>>> You received this message because you are subscribed to the Google
>>> Groups "pandoc-discuss" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/pandoc-discuss/CAFbbbDPWanTwpD6uXU%3DfqLasQkM36iAEDj7sbLN7_uih8H%2B1Kw%40mail.gmail.com
>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>> 
>>> --
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "pandoc-discuss" group.
>>> To unsubscribe from this topic, visit
>>> https://groups.google.com/d/topic/pandoc-discuss/Xsg-phpJSCk/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to
>>> pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/pandoc-discuss/87twzmcojp.fsf%40jhu.edu.
>>> For more options, visit https://groups.google.com/d/optout.
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFbbbDMo7OOWCt2-L9euCGmWqU_LDWMYPACwvZCeaZ9JRWiUWg%40mail.gmail.com.
>> For more options, visit https://groups.google.com/d/optout.
> 
> -- 
> You received this message because you are subscribed to a topic in the Google Groups "pandoc-discuss" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/pandoc-discuss/Xsg-phpJSCk/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/87zj9cc6oh.fsf%40jhu.edu.
> For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-01-21 22:45 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-19  2:29 Cannot: (i) extract images from the input docx files; (ii) convert 2 docx input files into 1 html Tl Yim
     [not found] ` <4e05f6a9-4c84-49f2-bbf3-f0b5c5479a59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2015-01-19  2:38   ` Matthew Pickering
     [not found]     ` <CALuQ0m9nkPNtSDLDCSfhzWsGy7Kg77L5DSpUu4eFq5AF1od8_A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-19 13:29       ` Andrew Yim
     [not found]         ` <CAFbbbDMWd_0vfWtqhOfFLZt3neA7P6w-zNfZ=Je6Ks0u9rjsXw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-19 13:45           ` Matthew Pickering
2015-01-19 15:22           ` Jesse Rosenthal
     [not found]             ` <87k30i7q2d.fsf-4GNroTWusrE@public.gmane.org>
2015-01-19 16:39               ` Jesse Rosenthal
     [not found]                 ` <87ppaad8t1.fsf-4GNroTWusrE@public.gmane.org>
2015-01-19 22:48                   ` Andrew Yim
     [not found]                     ` <CAFbbbDPWanTwpD6uXU=fqLasQkM36iAEDj7sbLN7_uih8H+1Kw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-19 23:56                       ` Jesse Rosenthal
     [not found]                         ` <87twzmcojp.fsf-4GNroTWusrE@public.gmane.org>
2015-01-20 12:24                           ` Andrew Yim
     [not found]                             ` <CAFbbbDMo7OOWCt2-L9euCGmWqU_LDWMYPACwvZCeaZ9JRWiUWg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-21 18:47                               ` Jesse Rosenthal
     [not found]                                 ` <87zj9cc6oh.fsf-4GNroTWusrE@public.gmane.org>
2015-01-21 22:45                                   ` Andrew Yim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).