public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* New custom reader for extracting content from web pages
@ 2022-01-16 18:58 John MacFarlane
       [not found] ` <m2o84b1p8u.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: John MacFarlane @ 2022-01-16 18:58 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


I've added a new example of a custom reader, which runs
the 'readability-cli' program on HTML input before processing
it with pandoc, extracting the content and omitting navigation
and layout.

See
https://pandoc.org/custom-readers.html#example-extracting-the-content-from-web-pages

This shows how the new custom reader interface, when combined
with pandoc.read in the Lua API, can be used to add
preprocessors.

(Of course, you could do something similar in a shell script.
But doing it this way ensures that pandoc will be able to
retrieve resources (e.g. images) from the URL.  In addition,
the filter does some further processing to remove structural
Divs that clutter the output, and it is easily customizable.)


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-04-10  6:58 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-16 18:58 New custom reader for extracting content from web pages John MacFarlane
     [not found] ` <m2o84b1p8u.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
2022-01-22 22:02   ` Michael Love
     [not found]     ` <5d9fa569-19d0-490b-88cf-1fb5fe73a400n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-04-10  6:58       ` Z T

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).