New custom reader for extracting content from web pages

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* New custom reader for extracting content from web pages
@ 2022-01-16 18:58 John MacFarlane
       [not found] ` <m2o84b1p8u.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: John MacFarlane @ 2022-01-16 18:58 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

I've added a new example of a custom reader, which runs
the 'readability-cli' program on HTML input before processing
it with pandoc, extracting the content and omitting navigation
and layout.

See
https://pandoc.org/custom-readers.html#example-extracting-the-content-from-web-pages

This shows how the new custom reader interface, when combined
with pandoc.read in the Lua API, can be used to add
preprocessors.

(Of course, you could do something similar in a shell script.
But doing it this way ensures that pandoc will be able to
retrieve resources (e.g. images) from the URL.  In addition,
the filter does some further processing to remove structural
Divs that clutter the output, and it is easily customizable.)

^ permalink raw reply	[flat|nested] 3+ messages in thread

[parent not found: <m2o84b1p8u.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>]

* Re: New custom reader for extracting content from web pages
       [not found] ` <m2o84b1p8u.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
@ 2022-01-22 22:02   ` Michael Love
       [not found]     ` <5d9fa569-19d0-490b-88cf-1fb5fe73a400n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Michael Love @ 2022-01-22 22:02 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1.1: Type: text/plain, Size: 2172 bytes --]

Tried your suggestion from private (please excuse my shyness) message. 
 Found out that this is well beyond my abilities.  Got node installed. 
 Installed readability-cli.  Discovered that I had to have an init.lua 
file.  Found one and saved it in directory holding pandoc.  Copied your 
script and saved it as readable.lua, in the same directory.  Discovered 
that I don't seem to know where init.lua and readable.lua need to go.  So 
far, when I run "pandoc -f readable.lua [et cetera]", the response is: 
"error running Lua: [new line] cannot open readable.lua: No such file or 
directory."  All this was on MacOS.  Tried same on MSFT WinOS; similar 
result.
My weariness convinces me that I am Well Out of My Depth.  Don't seem to 
know enough about Node, Lua, or whatever pandoc is written in.  Learned 
some things about Node and Lua.  Success is over an horizon too far. 
 Someday.
Thanks anyway.
Pandoc is still wonderful.

On Sunday, January 16, 2022 at 1:59:14 PM UTC-5 John MacFarlane wrote:

>
> I've added a new example of a custom reader, which runs
> the 'readability-cli' program on HTML input before processing
> it with pandoc, extracting the content and omitting navigation
> and layout.
>
> See
>
> https://pandoc.org/custom-readers.html#example-extracting-the-content-from-web-pages
>
> This shows how the new custom reader interface, when combined
> with pandoc.read in the Lua API, can be used to add
> preprocessors.
>
> (Of course, you could do something similar in a shell script.
> But doing it this way ensures that pandoc will be able to
> retrieve resources (e.g. images) from the URL. In addition,
> the filter does some further processing to remove structural
> Divs that clutter the output, and it is easily customizable.)
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/5d9fa569-19d0-490b-88cf-1fb5fe73a400n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 3211 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

[parent not found: <5d9fa569-19d0-490b-88cf-1fb5fe73a400n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]

* Re: New custom reader for extracting content from web pages
       [not found]     ` <5d9fa569-19d0-490b-88cf-1fb5fe73a400n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2022-04-10  6:58       ` Z T
  0 siblings, 0 replies; 3+ messages in thread
From: Z T @ 2022-04-10  6:58 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 2848 bytes --]

I'm in a similar position as the last response. This reader looks great, 
but despite having read as much documentation as I could on lua filters, I 
haven't been able to get this to work.

Is it possible to get this to work with copying the script or is it 
necessary to use 'npm install -g readability-cli'? Are there other 
prerequisites to using lua filters? I've 
seen https://pandoc.org/lua-filters.html#lua-interpreter-initialization but 
not sure I understand what the requirements are.


On Sunday, January 23, 2022 at 9:02:05 AM UTC+11 myke...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:

> Tried your suggestion from private (please excuse my shyness) message. 
>  Found out that this is well beyond my abilities.  Got node installed. 
>  Installed readability-cli.  Discovered that I had to have an init.lua 
> file.  Found one and saved it in directory holding pandoc.  Copied your 
> script and saved it as readable.lua, in the same directory.  Discovered 
> that I don't seem to know where init.lua and readable.lua need to go.  So 
> far, when I run "pandoc -f readable.lua [et cetera]", the response is: 
> "error running Lua: [new line] cannot open readable.lua: No such file or 
> directory."  All this was on MacOS.  Tried same on MSFT WinOS; similar 
> result.
> My weariness convinces me that I am Well Out of My Depth.  Don't seem to 
> know enough about Node, Lua, or whatever pandoc is written in.  Learned 
> some things about Node and Lua.  Success is over an horizon too far. 
>  Someday.
> Thanks anyway.
> Pandoc is still wonderful.
>  
>
> On Sunday, January 16, 2022 at 1:59:14 PM UTC-5 John MacFarlane wrote:
>
>>
>> I've added a new example of a custom reader, which runs 
>> the 'readability-cli' program on HTML input before processing 
>> it with pandoc, extracting the content and omitting navigation 
>> and layout. 
>>
>> See 
>>
>> https://pandoc.org/custom-readers.html#example-extracting-the-content-from-web-pages 
>>
>> This shows how the new custom reader interface, when combined 
>> with pandoc.read in the Lua API, can be used to add 
>> preprocessors. 
>>
>> (Of course, you could do something similar in a shell script. 
>> But doing it this way ensures that pandoc will be able to 
>> retrieve resources (e.g. images) from the URL. In addition, 
>> the filter does some further processing to remove structural 
>> Divs that clutter the output, and it is easily customizable.) 
>>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/1ae43988-8d34-4bdf-ba11-f875d8f69943n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 3997 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-04-10  6:58 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-16 18:58 New custom reader for extracting content from web pages John MacFarlane
     [not found] ` <m2o84b1p8u.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
2022-01-22 22:02   ` Michael Love
     [not found]     ` <5d9fa569-19d0-490b-88cf-1fb5fe73a400n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-04-10  6:58       ` Z T

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).