public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: Kolen Cheung <christian.kolen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: HTML → EPUB: Either "Out of memory" or "openBinaryFile: invalid argument (Invalid argument)"
Date: Tue, 28 Apr 2020 17:44:50 -0700 (PDT)	[thread overview]
Message-ID: <e6b7a39d-47da-482e-ac03-13e593f3c630@googlegroups.com> (raw)
In-Reply-To: <774af370-df13-43ec-97bc-68af09d2c2f4-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 5597 bytes --]

I think you could still ask people here. Pandoc Discuss is more casual and 
as long as you make it clear I think people are ok with it. There's some 
experts here about format conversions (for obvious reasons.)

Or if you still want to use pandoc in part of the "from PDF" workflow then 
it is very relevant here. Essentially pandoc can take many formats as input 
and many other softwares can read from PDF and write to a certain formats 
pandoc understands. The question then which intermediate format and 
software are best to use.

e.g. I have tried Acrobat/Word to docx and then pass it to pandoc.

For PDF images I converted them to svg and inline as an image by pdf2svg 
(inline PDF as image is also ok for some output formats.)

On Thursday, April 23, 2020 at 7:53:02 AM UTC-7, Heck Lennon wrote:
>
> Thanks for the tip.
>
> 1. Removed the Ubuntu 2.5-2 package through apt-get remove
> 2. Downloaded and installed pandoc-2.9.2.1-1-amd64.deb
> 3. Ran : pandoc -f html -t epub3 -o output.epub3.epub input.html AND 
> pandoc -f html -t epub -o output.epub2.epub input.html
> 4. Opened each file in Windows with SumatraPDF (which suppports epub): 
> Both opened OK.
>
> The issue remains on how to better convert PDF to HTML, but this has 
> nothing to do with pandoc.
>
> Thank you all !
>
> Le jeudi 23 avril 2020 00:17:54 UTC+2, Kolen Cheung a écrit :
>>
>> Version too old. Try to reproduce it using the latest version: 
>> https://github.com/jgm/pandoc/releases/latest There's various way to 
>> install it, e.g. you can just unzip pandoc-2.9.2.1-linux-amd64.tar.gz and 
>> put pandoc and pandoc-citeproc to somewhere in your path, such as 
>> ~/.local/bin
>>
>> (To take one more step you can go to the GitHub Action to download the 
>> latest nightly build to make sure the problem has not been solved yet.)
>>
>> In general you'd want to ensure the problem has not been solved yet, and 
>> to do that you want the latest version, which unfortunately in distros with 
>> package manager can be a big problem because people often just use the one 
>> from there, which is too old especially from Ubuntu.
>>
>> On Wednesday, April 22, 2020 at 2:59:38 PM UTC-7, Heck Lennon wrote:
>>>
>>> pandoc 2.5.2 on Ubuntu 19.10.
>>>
>>> Turns out I had to use "-t epub" instead of "-t epub3" :
>>>
>>> pandoc -f html -t epub -o output.epub input.html
>>>
>>> Thank you.
>>>
>>> Le mercredi 22 avril 2020 17:58:39 UTC+2, John MacFarlane a écrit :
>>>>
>>>>
>>>> What pandoc version are you running on the linux box? 
>>>> This works fine for me. 
>>>>
>>>>
>>>> Heck Lennon <frdt...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes: 
>>>>
>>>> > Since I had a Linux host available, I went around that issue with 
>>>> Windows 
>>>> > and shell expansion. 
>>>> > 
>>>> > pandoc -f html -t epub3 -o output.epub input.html 
>>>> > 
>>>> > 
>>>> > pandoc ran successfully (no error message), but the EPUB can't be 
>>>> opened in 
>>>> > a Windows GUI application that supports EPUB files ("Error loading 
>>>> > file.epub"). Likewise, I can't open the file after changing its 
>>>> extension 
>>>> > from EPUB to ZIP. 
>>>> > 
>>>> > Here's the input files (HTML + PNGs): 
>>>> > 
>>>> > https://we.tl/t-5EeGXML1rb 
>>>> > 
>>>> > Do I need extra options in the command line? 
>>>> > 
>>>> > Le mercredi 22 avril 2020 11:55:49 UTC+2, Heck Lennon a écrit : 
>>>> >> 
>>>> >> Thanks everyone for the infos! 
>>>> >> 
>>>> >> Le mercredi 22 avril 2020 01:25:21 UTC+2, Kolen Cheung a écrit : 
>>>> >>> 
>>>> >>> A side note, since your goal is to convert from PDF to ePub, you 
>>>> probably 
>>>> >>> will have better results using other tools. Eg I know it can be 
>>>> converted 
>>>> >>> to docx, and then from docx to ePub. There may he tool that can 
>>>> help you 
>>>> >>> convert that directly too. Essentially for the tools you choose, 
>>>> you’d want 
>>>> >>> to choose one preserving most information. And since pandoc focuses 
>>>> many on 
>>>> >>> the structure of the document, much other information would be 
>>>> lost. The 
>>>> >>> choice of tool also depends on which ones you’re comfortable with, 
>>>> Eg the 
>>>> >>> PDF to docx I mentioned probably can be done by Adobe Acrobat and 
>>>> MS Word. 
>>>> >>> But they are proprietary and difficult to run from the command 
>>>> line. 
>>>> >>> 
>>>> >>> In your case, since you have a tool preconverted them to html 
>>>> already, 
>>>> >>> html to ePub can be done better by some other engines (since the 2 
>>>> are 
>>>> >>> closely related.) may be you can try Calibre which also have a cli. 
>>>> >> 
>>>> >> 
>>>> > 
>>>> > -- 
>>>> > You received this message because you are subscribed to the Google 
>>>> Groups "pandoc-discuss" group. 
>>>> > To unsubscribe from this group and stop receiving emails from it, 
>>>> send an email to pandoc-...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org 
>>>> > To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/pandoc-discuss/b3218bbb-9846-4e52-b201-7e4a1b8b09d6%40googlegroups.com. 
>>>>
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/e6b7a39d-47da-482e-ac03-13e593f3c630%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 8248 bytes --]

      parent reply	other threads:[~2020-04-29  0:44 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-21  0:29 Heck Lennon
     [not found] ` <cfd086c1-9fe5-41bd-b735-3cd8db7579d9-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-04-21  5:40   ` John MacFarlane
     [not found]     ` <m2d081o0qc.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
2020-04-21 10:10       ` Heck Lennon
     [not found]         ` <65ccb50b-6595-450d-86ca-c8103867e3bf-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-04-21 10:52           ` Heck Lennon
     [not found]             ` <f11a136c-0f32-4a59-b7cf-4aab865e1d68-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-04-21 18:21               ` John MacFarlane
     [not found]                 ` <m2368wog2l.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
2020-04-21 19:40                   ` Anders Eriksson DC
2020-04-21 23:25   ` Kolen Cheung
     [not found]     ` <879425ff-d491-4d0b-8ffe-db24ad9cce23-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-04-22  9:55       ` Heck Lennon
     [not found]         ` <14c0eaf0-b920-477c-a735-dded7f1df0c5-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-04-22 12:30           ` Heck Lennon
     [not found]             ` <b3218bbb-9846-4e52-b201-7e4a1b8b09d6-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-04-22 15:58               ` John MacFarlane
     [not found]                 ` <m2tv1bfr6q.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
2020-04-22 21:59                   ` Heck Lennon
     [not found]                     ` <026f695e-0849-4c01-969b-0c2ccbeb31b9-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-04-22 22:17                       ` Kolen Cheung
     [not found]                         ` <60dc6b96-7284-47e3-bbb2-938857c61dd5-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-04-23 14:53                           ` Heck Lennon
     [not found]                             ` <774af370-df13-43ec-97bc-68af09d2c2f4-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-04-29  0:44                               ` Kolen Cheung [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e6b7a39d-47da-482e-ac03-13e593f3c630@googlegroups.com \
    --to=christian.kolen-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).