* New custom reader for extracting content from web pages @ 2022-01-16 18:58 John MacFarlane [not found] ` <m2o84b1p8u.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> 0 siblings, 1 reply; 3+ messages in thread From: John MacFarlane @ 2022-01-16 18:58 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw I've added a new example of a custom reader, which runs the 'readability-cli' program on HTML input before processing it with pandoc, extracting the content and omitting navigation and layout. See https://pandoc.org/custom-readers.html#example-extracting-the-content-from-web-pages This shows how the new custom reader interface, when combined with pandoc.read in the Lua API, can be used to add preprocessors. (Of course, you could do something similar in a shell script. But doing it this way ensures that pandoc will be able to retrieve resources (e.g. images) from the URL. In addition, the filter does some further processing to remove structural Divs that clutter the output, and it is easily customizable.) ^ permalink raw reply [flat|nested] 3+ messages in thread
[parent not found: <m2o84b1p8u.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>]
* Re: New custom reader for extracting content from web pages [not found] ` <m2o84b1p8u.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> @ 2022-01-22 22:02 ` Michael Love [not found] ` <5d9fa569-19d0-490b-88cf-1fb5fe73a400n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 0 siblings, 1 reply; 3+ messages in thread From: Michael Love @ 2022-01-22 22:02 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 2172 bytes --] Tried your suggestion from private (please excuse my shyness) message. Found out that this is well beyond my abilities. Got node installed. Installed readability-cli. Discovered that I had to have an init.lua file. Found one and saved it in directory holding pandoc. Copied your script and saved it as readable.lua, in the same directory. Discovered that I don't seem to know where init.lua and readable.lua need to go. So far, when I run "pandoc -f readable.lua [et cetera]", the response is: "error running Lua: [new line] cannot open readable.lua: No such file or directory." All this was on MacOS. Tried same on MSFT WinOS; similar result. My weariness convinces me that I am Well Out of My Depth. Don't seem to know enough about Node, Lua, or whatever pandoc is written in. Learned some things about Node and Lua. Success is over an horizon too far. Someday. Thanks anyway. Pandoc is still wonderful. On Sunday, January 16, 2022 at 1:59:14 PM UTC-5 John MacFarlane wrote: > > I've added a new example of a custom reader, which runs > the 'readability-cli' program on HTML input before processing > it with pandoc, extracting the content and omitting navigation > and layout. > > See > > https://pandoc.org/custom-readers.html#example-extracting-the-content-from-web-pages > > This shows how the new custom reader interface, when combined > with pandoc.read in the Lua API, can be used to add > preprocessors. > > (Of course, you could do something similar in a shell script. > But doing it this way ensures that pandoc will be able to > retrieve resources (e.g. images) from the URL. In addition, > the filter does some further processing to remove structural > Divs that clutter the output, and it is easily customizable.) > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/5d9fa569-19d0-490b-88cf-1fb5fe73a400n%40googlegroups.com. [-- Attachment #1.2: Type: text/html, Size: 3211 bytes --] ^ permalink raw reply [flat|nested] 3+ messages in thread
[parent not found: <5d9fa569-19d0-490b-88cf-1fb5fe73a400n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: New custom reader for extracting content from web pages [not found] ` <5d9fa569-19d0-490b-88cf-1fb5fe73a400n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2022-04-10 6:58 ` Z T 0 siblings, 0 replies; 3+ messages in thread From: Z T @ 2022-04-10 6:58 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 2848 bytes --] I'm in a similar position as the last response. This reader looks great, but despite having read as much documentation as I could on lua filters, I haven't been able to get this to work. Is it possible to get this to work with copying the script or is it necessary to use 'npm install -g readability-cli'? Are there other prerequisites to using lua filters? I've seen https://pandoc.org/lua-filters.html#lua-interpreter-initialization but not sure I understand what the requirements are. On Sunday, January 23, 2022 at 9:02:05 AM UTC+11 myke...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote: > Tried your suggestion from private (please excuse my shyness) message. > Found out that this is well beyond my abilities. Got node installed. > Installed readability-cli. Discovered that I had to have an init.lua > file. Found one and saved it in directory holding pandoc. Copied your > script and saved it as readable.lua, in the same directory. Discovered > that I don't seem to know where init.lua and readable.lua need to go. So > far, when I run "pandoc -f readable.lua [et cetera]", the response is: > "error running Lua: [new line] cannot open readable.lua: No such file or > directory." All this was on MacOS. Tried same on MSFT WinOS; similar > result. > My weariness convinces me that I am Well Out of My Depth. Don't seem to > know enough about Node, Lua, or whatever pandoc is written in. Learned > some things about Node and Lua. Success is over an horizon too far. > Someday. > Thanks anyway. > Pandoc is still wonderful. > > > On Sunday, January 16, 2022 at 1:59:14 PM UTC-5 John MacFarlane wrote: > >> >> I've added a new example of a custom reader, which runs >> the 'readability-cli' program on HTML input before processing >> it with pandoc, extracting the content and omitting navigation >> and layout. >> >> See >> >> https://pandoc.org/custom-readers.html#example-extracting-the-content-from-web-pages >> >> This shows how the new custom reader interface, when combined >> with pandoc.read in the Lua API, can be used to add >> preprocessors. >> >> (Of course, you could do something similar in a shell script. >> But doing it this way ensures that pandoc will be able to >> retrieve resources (e.g. images) from the URL. In addition, >> the filter does some further processing to remove structural >> Divs that clutter the output, and it is easily customizable.) >> > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/1ae43988-8d34-4bdf-ba11-f875d8f69943n%40googlegroups.com. [-- Attachment #1.2: Type: text/html, Size: 3997 bytes --] ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2022-04-10 6:58 UTC | newest] Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-01-16 18:58 New custom reader for extracting content from web pages John MacFarlane [not found] ` <m2o84b1p8u.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> 2022-01-22 22:02 ` Michael Love [not found] ` <5d9fa569-19d0-490b-88cf-1fb5fe73a400n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2022-04-10 6:58 ` Z T
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).