public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* HTML to epub formulas including missing images
@ 2020-12-30 20:52 JabariG
       [not found] ` <9b6cfa58-4049-4660-9d7e-b3c55cd732c2n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: JabariG @ 2020-12-30 20:52 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1014 bytes --]

Hi,

I was able to convert a web page (downloaded using the "Webpage, complete" 
option in Chrome) to ePub with Pandoc beautifully. But there's one issue so 
far. All of the formulas (rendered in HTML with MathML) now have a the 
"broken image link" symbol next to each of them (see below). 




Other than this issue the formulas look fine. I'd like to convert more of 
these pages and avoid going through each epub file and deleting the missing 
images since it doesn't look like I need them

Is there an argument I can add to the CLI before converting my file to 
remove these images?

Best,
-Jabari

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/9b6cfa58-4049-4660-9d7e-b3c55cd732c2n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 33944 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: HTML to epub formulas including missing images
       [not found] ` <9b6cfa58-4049-4660-9d7e-b3c55cd732c2n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-12-31 16:57   ` John MacFarlane
       [not found]     ` <CA+1QmDkOwxqVzVLmxsdcL+ONc3kKBhKhQzv3+X4zu_=uJNw5MQ@mail.gmail.com>
  0 siblings, 1 reply; 3+ messages in thread
From: John MacFarlane @ 2020-12-31 16:57 UTC (permalink / raw)
  To: JabariG, pandoc-discuss

JabariG <aboutjabari-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> Hi,
>
> I was able to convert a web page (downloaded using the "Webpage, complete" 
> option in Chrome) to ePub with Pandoc beautifully. But there's one issue so 
> far. All of the formulas (rendered in HTML with MathML) now have a the 
> "broken image link" symbol next to each of them (see below). 

It may be that the web page contained mathml with a fallback
image?  Hard to say without having the source.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: HTML to epub formulas including missing images
       [not found]       ` <CA+1QmDkOwxqVzVLmxsdcL+ONc3kKBhKhQzv3+X4zu_=uJNw5MQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2021-01-01 20:56         ` John MacFarlane
  0 siblings, 0 replies; 3+ messages in thread
From: John MacFarlane @ 2021-01-01 20:56 UTC (permalink / raw)
  To: Jabari Garland; +Cc: pandoc-discuss


Looks like the chapter xhtml contains

<img xmlns="http://www.w3.org/1999/xhtml" src="../media/file11.svg" />

and that is the image that isn't displaying.  I'm not sure why it
isn't, but when I try to open it in Safari, it doesn't display an
image (as it usually does with SVGs).  Anyway, maybe that's not
relevant; you just want to get rid of the image.

If you do pandoc Chapter\ 18.html -t native, you'll see how
pandoc parses this content.  You should see things like

 ,Div ("",["MathJax_SVG_Display"],[("style","text-align: center;")])
     [Plain [Span ("MathJax-Element-2-Frame",["MathJax_SVG"],[("tabindex","0"),("mathml","<math xmlns=\"http://www.w3.org/1998/Math/MathML\" altimg-valign=\"-18.5\" display=\"block\"><mrow><mi>U</mi><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mstyle displaystyle=\"true\"><mrow><mo>&#x2211;</mo><mi>R</mi></mrow></mstyle></mrow></mfrac></mrow></math>"),("role","presentation"),("style","font-size: 100%; display: inline-block; position: relative;")]) [Image ("",[],[]) [] ("data:image/svg+xml;base64,PHN2ZyB4bWxuczp4bGluaz0iaHR0cDovL3d3dy53My5vcmcvMTk5OS94bGluayIgd2lkdGg9IjExLjIyM2V4IiBoZWlnaHQ9IjYuODcxZXgiIHZpZXdib3g9IjAgLTE0NDQuOCA0ODMyLjIgMjk1OC4zIiByb2xlPSJpbWciIGZvY3VzYWJsZT0iZmFsc2UiIHN0eWxlPSJ2ZXJ0aWNhbC1hbGlnbjogLTMuNTE1ZXg7IiBhcmlhLWhpZGRlbj0idHJ1ZSI+PGcgc3Ryb2tlPSJjdXJyZW50Q29sb3IiIGZpbGw9ImN1cnJlbnRDb2xvciIgc3Ryb2tlLXdpZHRoPSIwIiB0cmFuc2Zvcm09Im1hdHJpeCgxIDAgMCAtMSAwIDApIj48dXNlIGhyZWY9IiNNSk1BVEhJLTU1IiB4PSIwIiB5PSIwIj48L3VzZT48dXNlIGhyZWY9IiNNSk1BSU4tM0QiIHg9IjEwNDUiIHk9IjAiPjwvdXNlPjxnIHRyYW5zZm9ybT0idHJhbnNsYXRlKDE4MjMsMCkiPjxnIHRyYW5zZm9ybT0idHJhbnNsYXRlKDM5NywwKSI+PHJlY3Qgc3Ryb2tlPSJub25lIiB3aWR0aD0iMjQ5MCIgaGVpZ2h0PSI2MCIgeD0iMCIgeT0iMjIwIj48L3JlY3Q+PHVzZSBocmVmPSIjTUpNQUlOLTMxIiB4PSI5OTUiIHk9IjY3NiI+PC91c2U+PGcgdHJhbnNmb3JtPSJ0cmFuc2xhdGUoNjAsLTk2MykiPjx1c2UgaHJlZj0iI01KU1oyLTIyMTEiIHg9IjAiIHk9IjAiPjwvdXNlPjx1c2UgaHJlZj0iI01KTUFUSEktNTIiIHg9IjE2MTEiIHk9IjAiPjwvdXNlPjwvZz48L2c+PC9nPjwvZz48L3N2Zz4=",""),Span ("",["MJX_Assistive_MathML","MJX_Assistive_MathML_Block"],[("role","presentation")]) [Math DisplayMath "U = \\frac{1}{\\sum R}"]]]]

Here you have a Span with class "MathJax_SVG".  It contains two
elements:

- an svg image with a data uri
- a Span with class "MJX_Assistive_MathML" containing a Math
  element. 

All pandoc needs is the Math element, so we want to get rid of
the image.

What you need, then, is to write a Lua filter that removes Image
elements from inside Span elements with class "MathJax_SVG".

Here's the documentation on Lua filters:
https://pandoc.org/lua-filters.html

If you get stuck, I'm sure someone on the list can help out.

My guess is that something like this will work, but it's untested
and probably has mistakes:

function Span(el)
  if el.classes == {"MathJax_SVG"} then
    return pandoc.walk_inline(el, {Image = function(im) return {} end })
  end
end



Jabari Garland <aboutjabari-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> Hi John,
>
> Please see a zip attached of the html page and it's files (I stripped out
> most of the scripts but the page looks fine without them). I've also
> included my resulting epub file (and a PDF of the html jic something gets
> lost in translation).
>
> Below are the arguments I used in the CLI when converting to epub from HTML:
>
> pandoc -f html -t epub3 -o output.epub input.html
>
> *(I found this here linux - Convert HTML files to epub files
> programmatically ( command line ubuntu ) - Stack Overflow
> <https://stackoverflow.com/questions/21626219/convert-html-files-to-epub-files-programmatically-command-line-ubuntu>*
> *I'm using Windows 10 but it seems to have worked fine even though this
> suggestion was meant for a command line in Ubuntu)*
>
> Any help will be much appreciated.
>
> On Thu, Dec 31, 2020 at 11:57 AM John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> wrote:
>
>> JabariG <aboutjabari-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>>
>> > Hi,
>> >
>> > I was able to convert a web page (downloaded using the "Webpage,
>> complete"
>> > option in Chrome) to ePub with Pandoc beautifully. But there's one issue
>> so
>> > far. All of the formulas (rendered in HTML with MathML) now have a the
>> > "broken image link" symbol next to each of them (see below).
>>
>> It may be that the web page contained mathml with a fallback
>> image?  Hard to say without having the source.
>>
>>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/m2ft3kz89i.fsf%40MacBook-Pro.hsd1.ca.comcast.net.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-01-01 20:56 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-30 20:52 HTML to epub formulas including missing images JabariG
     [not found] ` <9b6cfa58-4049-4660-9d7e-b3c55cd732c2n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-12-31 16:57   ` John MacFarlane
     [not found]     ` <CA+1QmDkOwxqVzVLmxsdcL+ONc3kKBhKhQzv3+X4zu_=uJNw5MQ@mail.gmail.com>
     [not found]       ` <CA+1QmDkOwxqVzVLmxsdcL+ONc3kKBhKhQzv3+X4zu_=uJNw5MQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2021-01-01 20:56         ` John MacFarlane

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).