public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org>
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
Subject: New custom reader for extracting content from web pages
Date: Sun, 16 Jan 2022 10:58:57 -0800	[thread overview]
Message-ID: <m2o84b1p8u.fsf@MacBook-Pro-2.hsd1.ca.comcast.net> (raw)


I've added a new example of a custom reader, which runs
the 'readability-cli' program on HTML input before processing
it with pandoc, extracting the content and omitting navigation
and layout.

See
https://pandoc.org/custom-readers.html#example-extracting-the-content-from-web-pages

This shows how the new custom reader interface, when combined
with pandoc.read in the Lua API, can be used to add
preprocessors.

(Of course, you could do something similar in a shell script.
But doing it this way ensures that pandoc will be able to
retrieve resources (e.g. images) from the URL.  In addition,
the filter does some further processing to remove structural
Divs that clutter the output, and it is easily customizable.)


             reply	other threads:[~2022-01-16 18:58 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-16 18:58 John MacFarlane [this message]
     [not found] ` <m2o84b1p8u.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
2022-01-22 22:02   ` Michael Love
     [not found]     ` <5d9fa569-19d0-490b-88cf-1fb5fe73a400n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-04-10  6:58       ` Z T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m2o84b1p8u.fsf@MacBook-Pro-2.hsd1.ca.comcast.net \
    --to=jgm-tvlzxgkolnx2fbvcvol8/a@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).