public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: Adam Ryczkowski <adam.ryczkowski-fT3I5MopIUtcgLbgGiZUJQ@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: What is the format of the filename of images in the Pictures/ folder inside odt?
Date: Sun, 3 Dec 2023 03:13:07 -0800 (PST)	[thread overview]
Message-ID: <734880a0-db9e-4855-b228-22902fbb387an@googlegroups.com> (raw)
In-Reply-To: <CD7D354B-58AD-485A-8B5A-2DD06510C517-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 3200 bytes --]

When Pandoc creates ODT file from HTML containing SVG images, it losslessly 
embeds the images as-is. That's a problem for me, because my pipeline does 
some automatic transformations to the document using LibreOffice UNO API 
and ultimately saves as DOCX. When LibreOffice saves a ODT file containing 
the SVG images into DOCX, it rasterizes the images in a very poor 
resolution, that is according to folks in the LibreOffice forum, 
uncontrollable. But there is a trick: I can use Inkscape to do convert the 
SVG images into EMF. EMF files are not rasterized by LibreOffice when it 
saves the document as DOCX. 

The problem is that the EMF files have obviously different binary content 
than SVG originals. When I replace them in the "Pictures/" folder inside 
the ODT, LibreOffice notices that the file name of the EMF pictures does 
not match their hash and claims the "image is corrupted" and gives an 
option to repair. Unfortunately, that repair dialog cannot get automated in 
the headless environment, which means I need to know how to make the 
"non-broken" ODT document in the first place. For that I need to know the 
hashing scheme. 

I tried to read the Pandoc sources to get the answer myself, but my zero 
knowledge of Haskell is a major obstacle. 

My gut feeling says the answer is somewhere in the 
`pandoc/src/Text/Pandoc/Writers/OpenDocument.hs`. 

On Saturday, December 2, 2023 at 9:05:42 PM UTC John MacFarlane wrote:

> Why is it necessary to do this? Docx can handle svgs, can't it?
>
> > On Dec 2, 2023, at 6:16 AM, Adam Ryczkowski <adam.ry...-fT3I5MopIUtcgLbgGiZUJQ@public.gmane.org> 
> wrote:
> > 
> > Hi!
> > 
> > I write a script that replaces "svg" images with "emf" in the odt in 
> order to allow lossless convertion to "docx" format using LibreOffice. 
> > 
> > The problem is that mere replacing the files and fixing the 
> `content.xml` does not suffice. The image file name is some form of hash of 
> its contents. If the contents does not match, Libreoffice reports the 
> document to be "broken" (but allows to repair). Alas, this repair cannot be 
> automated. 
> > 
> > I tried to get that from the Pandoc source code, but Haskell's syntax 
> seem too alien to me. 
> > 
> > What is the naming convention for the files in the Pictures/ folder?
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/88f8c8dc-b9b6-4e6e-91ce-75e08412e466n%40googlegroups.com
> .
>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/734880a0-db9e-4855-b228-22902fbb387an%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 4324 bytes --]

  parent reply	other threads:[~2023-12-03 11:13 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-02 14:16 Adam Ryczkowski
     [not found] ` <88f8c8dc-b9b6-4e6e-91ce-75e08412e466n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2023-12-02 21:05   ` John MacFarlane
     [not found]     ` <CD7D354B-58AD-485A-8B5A-2DD06510C517-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2023-12-03 11:13       ` Adam Ryczkowski [this message]
     [not found]         ` <734880a0-db9e-4855-b228-22902fbb387an-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2023-12-03 18:29           ` John MacFarlane

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=734880a0-db9e-4855-b228-22902fbb387an@googlegroups.com \
    --to=adam.ryczkowski-ft3i5mopiutcglbggizujq@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).