From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/30426 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Z T Newsgroups: gmane.text.pandoc Subject: Re: New custom reader for extracting content from web pages Date: Sat, 9 Apr 2022 23:58:32 -0700 (PDT) Message-ID: <1ae43988-8d34-4bdf-ba11-f875d8f69943n@googlegroups.com> References: <5d9fa569-19d0-490b-88cf-1fb5fe73a400n@googlegroups.com> Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_4056_2037753708.1649573912917" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="26400"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBD7PJJ4LZYBBBGUAZKJAMGQEK7SOE4Q-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Sun Apr 10 08:58:38 2022 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-oa1-f57.google.com ([209.85.160.57]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1ndRWv-0006ht-7j for gtp-pandoc-discuss@m.gmane-mx.org; Sun, 10 Apr 2022 08:58:37 +0200 Original-Received: by mail-oa1-f57.google.com with SMTP id 586e51a60fabf-dfaaeaf753sf6255112fac.14 for ; Sat, 09 Apr 2022 23:58:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20210112; h=sender:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=WztVY42qa7X+JQ5bDIhMlmGeAOf8GxvgYpRNC9vo7lE=; b=aHcO85elrCXnt5s9uXJmkGvAaKUrVCej9WIXAU6OaRRr+ZZ0AOgV88wmbCctvVfqnu RS0qn9Aucbq7+yLOoil4b/uQUL7/XMQOia+wLn8++/X1WXD0QD/21DCepoydp61YFwM/ iVwRTExEogdlVdJIWSkcgbuIFpphpU+P0vZkRvAC/kTiVv28dMVQk62L7WmG6DWRIKfP BHwLnYE5RcrDbuShaoFXlgEO3QF35LZUcoVFRDdijP8hgY0MR8SoVEh0vYF9lXYzaVqK 1ODtRtWyzjuTCaFEp99VhnGYIr9FKzR2SFQkIusFa01Fgr2BUWwekbpW6xY3TdrbK6Iu 1dWQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=date:from:to:message-id:in-reply-to:references:subject:mime-version :x-original-sender:reply-to:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=WztVY42qa7X+JQ5bDIhMlmGeAOf8GxvgYpRNC9vo7lE=; b=FGiA3YWiif+UPS34Mkg8P0sMZVfK0XJT57yGvCWoRlwA2yib9WwJHblJvcyzjwezJC dC6L7ACrTSJZrGloM2P+YjihTMY1lvd7q5ulzW+sc6W/dmBi0Q8nBN5nGqkCh9ooeHOg ChyN+36Bn787ktUKav9g2zcUmeT216A0qNrnzbctbHHBcJxxAxtDmzxsgmZkhRhysTVr 6AMS3U7TFg2rYv9MALxjpywCzoZv+IfJNIO57PYlLun4IWcSFaDa8oipPVexMVnVb/10 SukIsdo97qFE8YxbkALhxv/nhfolqNJaJjshjs2h4PZFoSktavWzQBc3B8Jg0TOPzwbC hRlw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to :references:subject:mime-version:x-original-sender:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=WztVY42qa7X+JQ5bDIhMlmGeAOf8GxvgYpRNC9vo7lE=; b=SezJQB8r8x90fpp8FvTA2Yvjs2n4w6mZzdpcvARxFGjoqxEMpMuOZj7XUu4ZP/IkbK /oCNkcFRdUifQMFuMtXW2c/KfEexboBSldJ1Z2oFLDFgWm+4e1ZwnHX3cYxpdSsaW9zs kOeYqeNG8JMiSc94FzckHSnaXnO4/UoUOXkpo6DTdVtS+HauI6pEHL6puS9Xi9vII6J7 Fa0N4IVHvwd3WxhRq8M3CyGTrep40L4oY+73S6zcugJNG47thi7SrT841JIWi5o/wZSn kdlG2FKAKQxjcg9An6ObRxGcjRaIWFo90Kr/99HKmaRK8nqdfyxn1U9w67JeLQyXpN6Y ACLA== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AOAM533AavtMJsmpQ9t4VHQxunf6cnAy5DiGPka/djNX7Zq44kRHYr4H EKkxO4h9Gsqv2YtwiySfrv4= X-Google-Smtp-Source: ABdhPJzkwsuOGGR9gFkdl6NOX+CSUCdF872PksHPjBPYKlBEYhqabps+BnHcMGml3Vn2DzJdR1Ufaw== X-Received: by 2002:a05:6870:8a07:b0:e2:c8a8:adb0 with SMTP id p7-20020a0568708a0700b000e2c8a8adb0mr107146oaq.98.1649573916246; Sat, 09 Apr 2022 23:58:36 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a05:6870:44:b0:df:dc4:2fc0 with SMTP id 4-20020a056870004400b000df0dc42fc0ls3460787oaz.11.gmail; Sat, 09 Apr 2022 23:58:33 -0700 (PDT) X-Received: by 2002:a05:6871:78b:b0:d4:2636:b26 with SMTP id o11-20020a056871078b00b000d426360b26mr11539775oap.14.1649573913679; Sat, 09 Apr 2022 23:58:33 -0700 (PDT) In-Reply-To: <5d9fa569-19d0-490b-88cf-1fb5fe73a400n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> X-Original-Sender: zzztirr-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:30426 Archived-At: ------=_Part_4056_2037753708.1649573912917 Content-Type: multipart/alternative; boundary="----=_Part_4057_1410513936.1649573912917" ------=_Part_4057_1410513936.1649573912917 Content-Type: text/plain; charset="UTF-8" I'm in a similar position as the last response. This reader looks great, but despite having read as much documentation as I could on lua filters, I haven't been able to get this to work. Is it possible to get this to work with copying the script or is it necessary to use 'npm install -g readability-cli'? Are there other prerequisites to using lua filters? I've seen https://pandoc.org/lua-filters.html#lua-interpreter-initialization but not sure I understand what the requirements are. On Sunday, January 23, 2022 at 9:02:05 AM UTC+11 myke...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote: > Tried your suggestion from private (please excuse my shyness) message. > Found out that this is well beyond my abilities. Got node installed. > Installed readability-cli. Discovered that I had to have an init.lua > file. Found one and saved it in directory holding pandoc. Copied your > script and saved it as readable.lua, in the same directory. Discovered > that I don't seem to know where init.lua and readable.lua need to go. So > far, when I run "pandoc -f readable.lua [et cetera]", the response is: > "error running Lua: [new line] cannot open readable.lua: No such file or > directory." All this was on MacOS. Tried same on MSFT WinOS; similar > result. > My weariness convinces me that I am Well Out of My Depth. Don't seem to > know enough about Node, Lua, or whatever pandoc is written in. Learned > some things about Node and Lua. Success is over an horizon too far. > Someday. > Thanks anyway. > Pandoc is still wonderful. > > > On Sunday, January 16, 2022 at 1:59:14 PM UTC-5 John MacFarlane wrote: > >> >> I've added a new example of a custom reader, which runs >> the 'readability-cli' program on HTML input before processing >> it with pandoc, extracting the content and omitting navigation >> and layout. >> >> See >> >> https://pandoc.org/custom-readers.html#example-extracting-the-content-from-web-pages >> >> This shows how the new custom reader interface, when combined >> with pandoc.read in the Lua API, can be used to add >> preprocessors. >> >> (Of course, you could do something similar in a shell script. >> But doing it this way ensures that pandoc will be able to >> retrieve resources (e.g. images) from the URL. In addition, >> the filter does some further processing to remove structural >> Divs that clutter the output, and it is easily customizable.) >> > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/1ae43988-8d34-4bdf-ba11-f875d8f69943n%40googlegroups.com. ------=_Part_4057_1410513936.1649573912917 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I'm in a similar position as the last response. This reader looks great, bu= t despite having read as much documentation as I could on lua filters, I ha= ven't been able to get this to work.

Is it possible to g= et this to work with copying the script or is it necessary to use 'npm inst= all -g readability-cli'? Are there other prerequisites to using lua filters= ? I've seen https://pandoc.org/lua-filters.html#lua-interpreter-initia= lization but not sure I understand what the requirements are.

On Sunday, January 23, 2022 at 9:02:05 AM UTC+11 myke...@gmail= .com wrote:
T= ried your suggestion from private (please excuse my shyness) message. =C2= =A0Found out that this is well beyond my abilities. =C2=A0Got node installe= d. =C2=A0Installed readability-cli. =C2=A0Discovered that I had to have an = init.lua file. =C2=A0Found one and saved it in directory holding pandoc. = =C2=A0Copied your script and saved it as readable.lua, in the same director= y. =C2=A0Discovered that I don't seem to know where init.lua and readab= le.lua need to go. =C2=A0So far, when I run "pandoc -f readable.lua [e= t cetera]", the response is: "error running Lua: [new line] canno= t open readable.lua: No such file or directory." =C2=A0All this was on= MacOS. =C2=A0Tried same on MSFT WinOS; similar result.
My weariness co= nvinces me that I am Well Out of My Depth. =C2=A0Don't seem to know eno= ugh about Node, Lua, or whatever pandoc is written in. =C2=A0Learned some t= hings about Node and Lua. =C2=A0Success is over an horizon too far. =C2=A0S= omeday.
Thanks anyway.
Pandoc is still wonderful.
=
=C2=A0

On Sunday, January 16, 2022 at 1:59:14 PM UTC-5 John MacFarlane wrote:
=

I've added a new example of a custom reader, which runs
the 'readability-cli' program on HTML input before processing
it with pandoc, extracting the content and omitting navigation
and layout.

See
https://pandoc.org/custom-readers.html#example-extracting-the-content-from= -web-pages

This shows how the new custom reader interface, when combined
with pandoc.read in the Lua API, can be used to add
preprocessors.

(Of course, you could do something similar in a shell script.
But doing it this way ensures that pandoc will be able to
retrieve resources (e.g. images) from the URL. In addition,
the filter does some further processing to remove structural
Divs that clutter the output, and it is easily customizable.)

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d= /msgid/pandoc-discuss/1ae43988-8d34-4bdf-ba11-f875d8f69943n%40googlegroups.= com.
------=_Part_4057_1410513936.1649573912917-- ------=_Part_4056_2037753708.1649573912917--