From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/29988 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: John MacFarlane Newsgroups: gmane.text.pandoc Subject: New custom reader for extracting content from web pages Date: Sun, 16 Jan 2022 10:58:57 -0800 Message-ID: Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="3245"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-X-From: pandoc-discuss+bncBCJZJHG45QDBBAGWSGHQMGQEF5K572A-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Sun Jan 16 19:59:16 2022 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-ua1-f62.google.com ([209.85.222.62]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1n9AkF-0000fI-4P for gtp-pandoc-discuss@m.gmane-mx.org; Sun, 16 Jan 2022 19:59:15 +0100 Original-Received: by mail-ua1-f62.google.com with SMTP id q19-20020ab04a13000000b002fef2f854a6sf8959639uae.7 for ; Sun, 16 Jan 2022 10:59:15 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1642359554; cv=pass; d=google.com; s=arc-20160816; b=Ih+CD0fbVLv6fK+UXCQhcOt4owmgfRfXwIBcJK7lcjoqaVMQwW4jwtxT5ERw9O41gv m+uvA0bIy9gnLk2EKI7UCLcCebz411A6gJAUQ7dpr3tMybNzRU/VDaM0jDtrHXb1N/EL Uo3b92xQoswu06MllwXbpjXsSqMbl7LPtI3ZZzOe4Mcqn+R9wj1JuMg+JfH/NYgyxzEb eB48NVx5i0fkl91RuIJ3arLjYoGIFalAlM9cDqzYPOmyYiZzSyb3dWW8qNfRCNWcczJu CkZZB0yt2mboTEvD++ZT/eFvEjzJkQL6oEfnfx8/J90DBS1xmgaPPK2XLMPxcrKk69MH Tmlw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :list-id:mailing-list:precedence:reply-to:mime-version:message-id :date:subject:to:from:sender:dkim-signature; bh=1G74wF7D0fyHLWVGzfs0LCDJHFmH6PH31B2QK0sKi9U=; b=1KHUyO2ypUaTz2doHS18D6urKrSka1QodBfAt0HoGgDRVOJyeVuhhuXdv5ztd9a/uP oNNXDG4m6pjw6O7kqAKtNGXX1dpCaSrbiRnyeOjnMFnE4JNDLgWb8YeVN+cofuxAg9Rx sFl6wrz0tF7N4Nc8LE56Paf8iqJYfjhzBLqKw623X6ctaGRewt3kNbcuRa03m55bGsP6 /Ughl/SLzjjdRWGJPwGYrb6Ob6pUkyLmafEaJgLwVWcJJj0zph/Gk6oigxY34cgxcX78 fI3JFDabSkYNzUoB1URp+9w4iORJizepHDQ0sGEkbubrUqUv16vcDzDF3QsxywT6ae7X 04pg== ARC-Authentication-Results: i=2; gmr-mx.google.com; dkim=pass header.i=@berkeley-edu.20210112.gappssmtp.com header.s=20210112 header.b=fL+bYsTn; spf=pass (google.com: domain of jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org designates 2607:f8b0:4864:20::1034 as permitted sender) smtp.mailfrom=jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20210112; h=sender:from:to:subject:date:message-id:mime-version :x-original-sender:x-original-authentication-results:reply-to :precedence:mailing-list:list-id:list-post:list-help:list-archive :list-subscribe:list-unsubscribe; bh=1G74wF7D0fyHLWVGzfs0LCDJHFmH6PH31B2QK0sKi9U=; b=m+Vb4UaMoknHMw1Ni3ytPySdooCSdYaZ6DdGBsPpCadRXRGqhOBJEHurvRNjzeuURs oWF8YOUs/68/DE01mbEdO+hrAqw5dv3or/n7tSH2WfmINwFoJZR1tfBlbSOyyXyZYEX6 rz4uoZ5yGOB5KZlCoU9uTLupNnFxuGjZnC13kzLLvyONnrN/UIZieMNfIiysz0/QBdMi AL6bYXjVzFNq7EUlI8zPVv/GIMqqZAXPAtbP569NHqwbJL6flMN0SnmleMC0FHDs5+K6 RpgLVkkvT8fIdWia/19GCrrEBXqHz9FpyKHE2RTW6HZd93xaZ7y1nIZ3+TUVO4DX7SKA J1dQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=sender:x-gm-message-state:from:to:subject:date:message-id :mime-version:x-original-sender:x-original-authentication-results :reply-to:precedence:mailing-list:list-id:x-spam-checked-in-group :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=1G74wF7D0fyHLWVGzfs0LCDJHFmH6PH31B2QK0sKi9U=; b=0sBkoJwS5DcoxqRPZEV+jA1zE7VdgRK2hX9++95xXK5p4Q+aBy3myPXFS3q0w5Qf0u aaESW5/V+A6OYvllTz0xGH6Kzom6Aq7CmhJuDg5TEEFPy6Qw7sJW65ScfAlZLNyoHI1k v6o8DdFES2XnojjofDVdbSXTUOC26GroYwlJ9//66vfpKAqQr7taZkNLXnU6O/tLkTpv uUMk6A7VEKL8NvOMIVLGM3mRkwO7If3qTpYA3areOYViTAopmB53hm/KeYVIukJSCa+v ivsl2ngnAfzkiD2xXDsEZU9F005scgQ7q3Gf4msZeyWpwxxhCoMHgdCcYpU7iyMPPmn7 8mfQ== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AOAM530ZPPuwm5q+EJIJbFGeeJzkZk1W6lbkeh2Dw8eGXeLlyXMyYg97 ltt2xSLKBlzys0tgOKix5xI= X-Google-Smtp-Source: ABdhPJxM2e0G+MA/AB5UNJIZCm982A+mHHKZpOd0dkDHh+sGhU9sWHwDxI5iO/uggd/fR0aHJBJG5w== X-Received: by 2002:a05:6102:3023:: with SMTP id v3mr6110313vsa.25.1642359553992; Sun, 16 Jan 2022 10:59:13 -0800 (PST) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:ab0:3c4d:: with SMTP id u13ls901626uaw.4.gmail; Sun, 16 Jan 2022 10:59:12 -0800 (PST) X-Received: by 2002:a9f:3802:: with SMTP id p2mr6849362uad.35.1642359552434; Sun, 16 Jan 2022 10:59:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1642359552; cv=none; d=google.com; s=arc-20160816; b=A2wxtS4lIJxltbt/Hfw8ZsGX3prDyJtL9ASeCMriu+B9IvVfZEveXQwAeBexxNWzx+ WhocJ7XWPUdLFWPrloTz4bJnvMIzw96Y4oy2xy15K8m+n+b6HxliIsIRiE+Fz8KYVWGg 6KWOaaQKqlIZXGGpfceB0bb2TQ9DzHhsn9F2pY1CAPnltAv7lElGU9c7t1Hq5gTwqnkf CFWG5LymaYgubCcJnmoekb0BG1PA2bOCtDU24Ppc7sBozZZtYzoJ7koodzCee8Ke0bUf xsMx4oCt/Uo+5sLCelskdmUNhoLz61KeNVHO3jF/G2BYguyh+t8jTqHsLSw+2ZIRw8RM 4c7g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:message-id:date:subject:to:from:dkim-signature; bh=SD0aFOUrQek7yJ+Ostk+yBQPK+FGTqr9Q6fjYcF5C+k=; b=kHzcF+c6VdvrEh09lKWSp3PhGvNPsWi2EWDZZkgHQyMzibGaXYrs916WqGSMiCtuMb Dqh9IdimKyDh6hTqTx+a85sPf7eCvQDHx+q2sH81mLWd9VFdltCtgBWH2jvY/BaYug// L60I5Ix1xXHz6GnUqB3fJnmOGkAHHmjGy3L/FkSYPt1MQpEdsWc0dNxGJoJTz8oWrIzB KTW8FPY/5qbj8uaMEQHLyxTLZs3V9FcUQS9AL0USvDSVToxPBYp1SqDdj9nJ2ZZoECeM tvcpEwtTQu00h/8RGdPzQPfGDE5TMTxBGnc/CC6OtKpzWpdUWNn+sCNEW7bskD0iSKZ5 H/uw== ARC-Authentication-Results: i=1; gmr-mx.google.com; dkim=pass header.i=@berkeley-edu.20210112.gappssmtp.com header.s=20210112 header.b=fL+bYsTn; spf=pass (google.com: domain of jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org designates 2607:f8b0:4864:20::1034 as permitted sender) smtp.mailfrom=jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org Original-Received: from mail-pj1-x1034.google.com (mail-pj1-x1034.google.com. [2607:f8b0:4864:20::1034]) by gmr-mx.google.com with ESMTPS id p191si628950vkp.1.2022.01.16.10.59.12 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 16 Jan 2022 10:59:12 -0800 (PST) Received-SPF: pass (google.com: domain of jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org designates 2607:f8b0:4864:20::1034 as permitted sender) client-ip=2607:f8b0:4864:20::1034; Original-Received: by mail-pj1-x1034.google.com with SMTP id a1-20020a17090a688100b001b3fd52338eso18059129pjd.1 for ; Sun, 16 Jan 2022 10:59:12 -0800 (PST) X-Received: by 2002:a17:90b:19ca:: with SMTP id nm10mr11697502pjb.65.1642359550980; Sun, 16 Jan 2022 10:59:10 -0800 (PST) Original-Received: from johnmacfarlane.net (li55-134.members.linode.com. [74.82.3.134]) by smtp.gmail.com with ESMTPSA id u7sm16577402pjy.6.2022.01.16.10.59.09 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 16 Jan 2022 10:59:10 -0800 (PST) Original-Received: by johnmacfarlane.net (Postfix, from userid 1000) id 267C0A29D; Sun, 16 Jan 2022 13:58:58 -0500 (EST) X-Original-Sender: jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org X-Original-Authentication-Results: gmr-mx.google.com; dkim=pass header.i=@berkeley-edu.20210112.gappssmtp.com header.s=20210112 header.b=fL+bYsTn; spf=pass (google.com: domain of jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org designates 2607:f8b0:4864:20::1034 as permitted sender) smtp.mailfrom=jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:29988 Archived-At: I've added a new example of a custom reader, which runs the 'readability-cli' program on HTML input before processing it with pandoc, extracting the content and omitting navigation and layout. See https://pandoc.org/custom-readers.html#example-extracting-the-content-from-web-pages This shows how the new custom reader interface, when combined with pandoc.read in the Lua API, can be used to add preprocessors. (Of course, you could do something similar in a shell script. But doing it this way ensures that pandoc will be able to retrieve resources (e.g. images) from the URL. In addition, the filter does some further processing to remove structural Divs that clutter the output, and it is easily customizable.)