From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/30032 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Michael Love Newsgroups: gmane.text.pandoc Subject: Re: New custom reader for extracting content from web pages Date: Sat, 22 Jan 2022 14:02:05 -0800 (PST) Message-ID: <5d9fa569-19d0-490b-88cf-1fb5fe73a400n@googlegroups.com> References: Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_2942_8996590.1642888925206" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="24715"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBD72FXX4ZQFBBXX5WGHQMGQEM734BIQ-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Sat Jan 22 23:02:09 2022 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-ot1-f64.google.com ([209.85.210.64]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1nBOSX-0006DA-5f for gtp-pandoc-discuss@m.gmane-mx.org; Sat, 22 Jan 2022 23:02:09 +0100 Original-Received: by mail-ot1-f64.google.com with SMTP id l18-20020a9d7a92000000b0059e15cde526sf2061318otn.0 for ; Sat, 22 Jan 2022 14:02:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20210112; h=sender:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=txnZFSrNKqiy9hAKDJdJV3GGC80JQNtrPBHkKtitFIk=; b=YUYUy80WPZvMGxuykIE3M8WdIYJYSNUUIvmqFjIWWcrd9+7rTkZXTqUo00tEi0qlAh Ux4emjTu8sWyjxvHzlhit9ZKDk6/dkq7on/mvVjE5a2wyo3Mo5BVqbzDIlGGZLhj+hVx 5+oJUg+mycD1+PXoxBJ+st+Obb0iw21TV5W5wQO9Dg0iPqi7Qr1Hm1u0Ba7zxALFpi63 QVRTe6f2AGn3MUIChbGcFwHbqAFyJoebw3kR98x/vtyVqYj0hSvtpti3JAkCkuRq1FgN D3YS8KyYQQfP2NdontkAAsQTAzLhZZ4/qaM6MlO29xsV/lWMbvg/qnzVm7rFOsec/NYE Y7gQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=date:from:to:message-id:in-reply-to:references:subject:mime-version :x-original-sender:reply-to:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=txnZFSrNKqiy9hAKDJdJV3GGC80JQNtrPBHkKtitFIk=; b=AHlGtSWlkeN3B03KDSboPuNcsDe1QopDTAp5WLxy/pRLoDHy8EIXRQucESD+WVfrSk V3XuTVFMjtQFmmNguYaUiEh+jfwfLK9nBKGd7w9b307YC1+lbJny0b4bfVwS+EhXrf5U C1FtiS+08+xizJx/Z780oYqWShuh2diZtjhXPGFystm/xyXQo5G/O5RJHshYOO3x6dYb WGNUBWF9QP+9x1qMwgLVKgc6zehiau2fek3tJ1cZv7X8NzYBhtzu0NLDgzdrxncR/I+C sSiiXrpMZoasl3NdOuzZu1OPx7ec+tgGPkHv3rhq9PU0WJQxWEY0Sr2m2MOEOprygeZZ tCFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to :references:subject:mime-version:x-original-sender:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=txnZFSrNKqiy9hAKDJdJV3GGC80JQNtrPBHkKtitFIk=; b=nVBwHE0YylQTovFFiGtKnCLnj7wN917H39F97DVrzQbrr+cPTo5mg3uI7CRiBqofUJ 6tA+k6UrVZPzSecyPnzoTlvSfIvc7qgNqL0psa/vGeL0XOeZHEfzBXnHCuElbvOrFLk4 VGaxaSON63UBhnsdhyBbrO+sek/1khFt2ThT0nAm4nL7xv9AkVbx/sT1uRriCGrYItM4 ZSl2ASp7OAsQHvlAX8dSGhsspPeqjYgz9nkHMxcBYD++ec1ofj8c/1UbnreYF4MaiHSX eIWK7u+k3OMJsU2ZWgMKaq4G8X/IdiWMUwkg/nJgHpZStBXPipA7id28ZAOgRCri8nSx aM9g== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AOAM530GiBzUr3rIkdVvlvpJlXeWR3AnztkrU2BmY8d+O/cojCs3HMgq vRv8DYFqlMKDAlw17fAzjxE= X-Google-Smtp-Source: ABdhPJyLkxuSkZK1izNdcNQUiUfYC8KStVsXXva6fHSAJ+bGCgZpuo0d/GEvlZEc+K5f0da0qGBEgQ== X-Received: by 2002:a05:6808:1b25:: with SMTP id bx37mr4967867oib.129.1642888927983; Sat, 22 Jan 2022 14:02:07 -0800 (PST) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a05:6830:43a8:: with SMTP id s40ls3687128otv.9.gmail; Sat, 22 Jan 2022 14:02:06 -0800 (PST) X-Received: by 2002:a9d:1b5:: with SMTP id e50mr7185479ote.90.1642888925857; Sat, 22 Jan 2022 14:02:05 -0800 (PST) In-Reply-To: X-Original-Sender: mykelove-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:30032 Archived-At: ------=_Part_2942_8996590.1642888925206 Content-Type: multipart/alternative; boundary="----=_Part_2943_1347396231.1642888925206" ------=_Part_2943_1347396231.1642888925206 Content-Type: text/plain; charset="UTF-8" Tried your suggestion from private (please excuse my shyness) message. Found out that this is well beyond my abilities. Got node installed. Installed readability-cli. Discovered that I had to have an init.lua file. Found one and saved it in directory holding pandoc. Copied your script and saved it as readable.lua, in the same directory. Discovered that I don't seem to know where init.lua and readable.lua need to go. So far, when I run "pandoc -f readable.lua [et cetera]", the response is: "error running Lua: [new line] cannot open readable.lua: No such file or directory." All this was on MacOS. Tried same on MSFT WinOS; similar result. My weariness convinces me that I am Well Out of My Depth. Don't seem to know enough about Node, Lua, or whatever pandoc is written in. Learned some things about Node and Lua. Success is over an horizon too far. Someday. Thanks anyway. Pandoc is still wonderful. On Sunday, January 16, 2022 at 1:59:14 PM UTC-5 John MacFarlane wrote: > > I've added a new example of a custom reader, which runs > the 'readability-cli' program on HTML input before processing > it with pandoc, extracting the content and omitting navigation > and layout. > > See > > https://pandoc.org/custom-readers.html#example-extracting-the-content-from-web-pages > > This shows how the new custom reader interface, when combined > with pandoc.read in the Lua API, can be used to add > preprocessors. > > (Of course, you could do something similar in a shell script. > But doing it this way ensures that pandoc will be able to > retrieve resources (e.g. images) from the URL. In addition, > the filter does some further processing to remove structural > Divs that clutter the output, and it is easily customizable.) > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/5d9fa569-19d0-490b-88cf-1fb5fe73a400n%40googlegroups.com. ------=_Part_2943_1347396231.1642888925206 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Tried your suggestion from private (please excuse my shyness) message. &nbs= p;Found out that this is well beyond my abilities.  Got node installed= .  Installed readability-cli.  Discovered that I had to have an i= nit.lua file.  Found one and saved it in directory holding pandoc. &nb= sp;Copied your script and saved it as readable.lua, in the same directory. =  Discovered that I don't seem to know where init.lua and readable.lua = need to go.  So far, when I run "pandoc -f readable.lua [et cetera]", = the response is: "error running Lua: [new line] cannot open readable.lua: N= o such file or directory."  All this was on MacOS.  Tried same on= MSFT WinOS; similar result.
My weariness convinces me that I am Well O= ut of My Depth.  Don't seem to know enough about Node, Lua, or whateve= r pandoc is written in.  Learned some things about Node and Lua.  = ;Success is over an horizon too far.  Someday.
Thanks anyway= .
Pandoc is still wonderful.
 

On Sunday, January 16, 2022 at 1:59:14 PM UTC-5 John MacFarlane wrote:

I've added a new example of a custom reader, which runs
the 'readability-cli' program on HTML input before processing
it with pandoc, extracting the content and omitting navigation
and layout.

See
https://pandoc.org/custom-readers.html#example-extracting-the-conten= t-from-web-pages

This shows how the new custom reader interface, when combined
with pandoc.read in the Lua API, can be used to add
preprocessors.

(Of course, you could do something similar in a shell script.
But doing it this way ensures that pandoc will be able to
retrieve resources (e.g. images) from the URL. In addition,
the filter does some further processing to remove structural
Divs that clutter the output, and it is easily customizable.)

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d= /msgid/pandoc-discuss/5d9fa569-19d0-490b-88cf-1fb5fe73a400n%40googlegroups.= com.
------=_Part_2943_1347396231.1642888925206-- ------=_Part_2942_8996590.1642888925206--