public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: Michael Love <mykelove-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: New custom reader for extracting content from web pages
Date: Sat, 22 Jan 2022 14:02:05 -0800 (PST)	[thread overview]
Message-ID: <5d9fa569-19d0-490b-88cf-1fb5fe73a400n@googlegroups.com> (raw)
In-Reply-To: <m2o84b1p8u.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 2172 bytes --]

Tried your suggestion from private (please excuse my shyness) message. 
 Found out that this is well beyond my abilities.  Got node installed. 
 Installed readability-cli.  Discovered that I had to have an init.lua 
file.  Found one and saved it in directory holding pandoc.  Copied your 
script and saved it as readable.lua, in the same directory.  Discovered 
that I don't seem to know where init.lua and readable.lua need to go.  So 
far, when I run "pandoc -f readable.lua [et cetera]", the response is: 
"error running Lua: [new line] cannot open readable.lua: No such file or 
directory."  All this was on MacOS.  Tried same on MSFT WinOS; similar 
result.
My weariness convinces me that I am Well Out of My Depth.  Don't seem to 
know enough about Node, Lua, or whatever pandoc is written in.  Learned 
some things about Node and Lua.  Success is over an horizon too far. 
 Someday.
Thanks anyway.
Pandoc is still wonderful.
 

On Sunday, January 16, 2022 at 1:59:14 PM UTC-5 John MacFarlane wrote:

>
> I've added a new example of a custom reader, which runs
> the 'readability-cli' program on HTML input before processing
> it with pandoc, extracting the content and omitting navigation
> and layout.
>
> See
>
> https://pandoc.org/custom-readers.html#example-extracting-the-content-from-web-pages
>
> This shows how the new custom reader interface, when combined
> with pandoc.read in the Lua API, can be used to add
> preprocessors.
>
> (Of course, you could do something similar in a shell script.
> But doing it this way ensures that pandoc will be able to
> retrieve resources (e.g. images) from the URL. In addition,
> the filter does some further processing to remove structural
> Divs that clutter the output, and it is easily customizable.)
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/5d9fa569-19d0-490b-88cf-1fb5fe73a400n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 3211 bytes --]

  parent reply	other threads:[~2022-01-22 22:02 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-16 18:58 John MacFarlane
     [not found] ` <m2o84b1p8u.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
2022-01-22 22:02   ` Michael Love [this message]
     [not found]     ` <5d9fa569-19d0-490b-88cf-1fb5fe73a400n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-04-10  6:58       ` Z T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5d9fa569-19d0-490b-88cf-1fb5fe73a400n@googlegroups.com \
    --to=mykelove-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).