public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* What is the format of the filename of images in the Pictures/ folder inside odt?
@ 2023-12-02 14:16 Adam Ryczkowski
       [not found] ` <88f8c8dc-b9b6-4e6e-91ce-75e08412e466n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Adam Ryczkowski @ 2023-12-02 14:16 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1021 bytes --]

Hi!

I write a script that replaces "svg" images with "emf" in the odt in order 
to allow lossless convertion to "docx" format using LibreOffice. 

The problem is that mere replacing the files and fixing the `content.xml` 
does not suffice. The image file name is some form of hash of its contents. 
If the contents does not match, Libreoffice reports the document to be 
"broken" (but allows to repair). Alas, this repair cannot be automated. 

I tried to get that from the Pandoc source code, but Haskell's syntax seem 
too alien to me. 

What is the naming convention for the files in the Pictures/ folder?

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/88f8c8dc-b9b6-4e6e-91ce-75e08412e466n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 1440 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: What is the format of the filename of images in the Pictures/ folder inside odt?
       [not found] ` <88f8c8dc-b9b6-4e6e-91ce-75e08412e466n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2023-12-02 21:05   ` John MacFarlane
       [not found]     ` <CD7D354B-58AD-485A-8B5A-2DD06510C517-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: John MacFarlane @ 2023-12-02 21:05 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Why is it necessary to do this?  Docx can handle svgs, can't it?

> On Dec 2, 2023, at 6:16 AM, Adam Ryczkowski <adam.ryczkowski@statystyka.net> wrote:
> 
> Hi!
> 
> I write a script that replaces "svg" images with "emf" in the odt in order to allow lossless convertion to "docx" format using LibreOffice. 
> 
> The problem is that mere replacing the files and fixing the `content.xml` does not suffice. The image file name is some form of hash of its contents. If the contents does not match, Libreoffice reports the document to be "broken" (but allows to repair). Alas, this repair cannot be automated. 
> 
> I tried to get that from the Pandoc source code, but Haskell's syntax seem too alien to me. 
> 
> What is the naming convention for the files in the Pictures/ folder?
> 
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/88f8c8dc-b9b6-4e6e-91ce-75e08412e466n%40googlegroups.com.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CD7D354B-58AD-485A-8B5A-2DD06510C517%40gmail.com.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: What is the format of the filename of images in the Pictures/ folder inside odt?
       [not found]     ` <CD7D354B-58AD-485A-8B5A-2DD06510C517-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2023-12-03 11:13       ` Adam Ryczkowski
       [not found]         ` <734880a0-db9e-4855-b228-22902fbb387an-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Adam Ryczkowski @ 2023-12-03 11:13 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 3200 bytes --]

When Pandoc creates ODT file from HTML containing SVG images, it losslessly 
embeds the images as-is. That's a problem for me, because my pipeline does 
some automatic transformations to the document using LibreOffice UNO API 
and ultimately saves as DOCX. When LibreOffice saves a ODT file containing 
the SVG images into DOCX, it rasterizes the images in a very poor 
resolution, that is according to folks in the LibreOffice forum, 
uncontrollable. But there is a trick: I can use Inkscape to do convert the 
SVG images into EMF. EMF files are not rasterized by LibreOffice when it 
saves the document as DOCX. 

The problem is that the EMF files have obviously different binary content 
than SVG originals. When I replace them in the "Pictures/" folder inside 
the ODT, LibreOffice notices that the file name of the EMF pictures does 
not match their hash and claims the "image is corrupted" and gives an 
option to repair. Unfortunately, that repair dialog cannot get automated in 
the headless environment, which means I need to know how to make the 
"non-broken" ODT document in the first place. For that I need to know the 
hashing scheme. 

I tried to read the Pandoc sources to get the answer myself, but my zero 
knowledge of Haskell is a major obstacle. 

My gut feeling says the answer is somewhere in the 
`pandoc/src/Text/Pandoc/Writers/OpenDocument.hs`. 

On Saturday, December 2, 2023 at 9:05:42 PM UTC John MacFarlane wrote:

> Why is it necessary to do this? Docx can handle svgs, can't it?
>
> > On Dec 2, 2023, at 6:16 AM, Adam Ryczkowski <adam.ry...-fT3I5MopIUtcgLbgGiZUJQ@public.gmane.org> 
> wrote:
> > 
> > Hi!
> > 
> > I write a script that replaces "svg" images with "emf" in the odt in 
> order to allow lossless convertion to "docx" format using LibreOffice. 
> > 
> > The problem is that mere replacing the files and fixing the 
> `content.xml` does not suffice. The image file name is some form of hash of 
> its contents. If the contents does not match, Libreoffice reports the 
> document to be "broken" (but allows to repair). Alas, this repair cannot be 
> automated. 
> > 
> > I tried to get that from the Pandoc source code, but Haskell's syntax 
> seem too alien to me. 
> > 
> > What is the naming convention for the files in the Pictures/ folder?
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/88f8c8dc-b9b6-4e6e-91ce-75e08412e466n%40googlegroups.com
> .
>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/734880a0-db9e-4855-b228-22902fbb387an%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 4324 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: What is the format of the filename of images in the Pictures/ folder inside odt?
       [not found]         ` <734880a0-db9e-4855-b228-22902fbb387an-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2023-12-03 18:29           ` John MacFarlane
  0 siblings, 0 replies; 4+ messages in thread
From: John MacFarlane @ 2023-12-03 18:29 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

look at insertMedia in T.P.MediaBag


> On Dec 3, 2023, at 3:13 AM, Adam Ryczkowski <adam.ryczkowski@statystyka.net> wrote:
> 
> When Pandoc creates ODT file from HTML containing SVG images, it losslessly embeds the images as-is. That's a problem for me, because my pipeline does some automatic transformations to the document using LibreOffice UNO API and ultimately saves as DOCX. When LibreOffice saves a ODT file containing the SVG images into DOCX, it rasterizes the images in a very poor resolution, that is according to folks in the LibreOffice forum, uncontrollable. But there is a trick: I can use Inkscape to do convert the SVG images into EMF. EMF files are not rasterized by LibreOffice when it saves the document as DOCX. 
> 
> The problem is that the EMF files have obviously different binary content than SVG originals. When I replace them in the "Pictures/" folder inside the ODT, LibreOffice notices that the file name of the EMF pictures does not match their hash and claims the "image is corrupted" and gives an option to repair. Unfortunately, that repair dialog cannot get automated in the headless environment, which means I need to know how to make the "non-broken" ODT document in the first place. For that I need to know the hashing scheme. 
> 
> I tried to read the Pandoc sources to get the answer myself, but my zero knowledge of Haskell is a major obstacle. 
> 
> My gut feeling says the answer is somewhere in the `pandoc/src/Text/Pandoc/Writers/OpenDocument.hs`. 
> 
> On Saturday, December 2, 2023 at 9:05:42 PM UTC John MacFarlane wrote:
> Why is it necessary to do this? Docx can handle svgs, can't it? 
> 
> > On Dec 2, 2023, at 6:16 AM, Adam Ryczkowski <adam.ry...-fT3I5MopIUtcgLbgGiZUJQ@public.gmane.org> wrote: 
> > 
> > Hi! 
> > 
> > I write a script that replaces "svg" images with "emf" in the odt in order to allow lossless convertion to "docx" format using LibreOffice. 
> > 
> > The problem is that mere replacing the files and fixing the `content.xml` does not suffice. The image file name is some form of hash of its contents. If the contents does not match, Libreoffice reports the document to be "broken" (but allows to repair). Alas, this repair cannot be automated. 
> > 
> > I tried to get that from the Pandoc source code, but Haskell's syntax seem too alien to me. 
> > 
> > What is the naming convention for the files in the Pictures/ folder? 
> > 
> > -- 
> > You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. 
> > To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org 
> > To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/88f8c8dc-b9b6-4e6e-91ce-75e08412e466n%40googlegroups.com. 
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/734880a0-db9e-4855-b228-22902fbb387an%40googlegroups.com.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/39F8E623-1064-437C-AB38-E01EF35DEC5B%40gmail.com.


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-12-03 18:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-02 14:16 What is the format of the filename of images in the Pictures/ folder inside odt? Adam Ryczkowski
     [not found] ` <88f8c8dc-b9b6-4e6e-91ce-75e08412e466n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2023-12-02 21:05   ` John MacFarlane
     [not found]     ` <CD7D354B-58AD-485A-8B5A-2DD06510C517-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2023-12-03 11:13       ` Adam Ryczkowski
     [not found]         ` <734880a0-db9e-4855-b228-22902fbb387an-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2023-12-03 18:29           ` John MacFarlane

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).