From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=0.2 required=5.0 tests=DKIM_INVALID,DKIM_SIGNED autolearn=no autolearn_force=no version=3.4.4 Received: (qmail 25814 invoked from network); 5 Aug 2021 22:58:06 -0000 Received: from 1ess.inri.net (216.126.196.35) by inbox.vuxu.org with ESMTPUTF8; 5 Aug 2021 22:58:06 -0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Received: from out1.migadu.com ([91.121.223.63]) by 1ess; Thu Aug 5 18:52:21 -0400 2021 Content-Transfer-Encoding: quoted-printable DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pixelhero.dev; s=key1; t=1628203608; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to; bh=t6TQXC2GJtP9wuXm9RrTikBcuMR6wVzFL1J+5D/122o=; b=GtdPRxfYQPw1tuwI8DIjpUjI0Uokfg5SLFZg4cIey3HirRe7upWTsYGVxxi2Mct5SaP/6a erd3NRFk5tLJ596Jfaxl+s5u42qZVLQajmr/wgo5trwsMi/k0fun+XVD9Uah2uIcsiN59M CnU/+wYc8Ik4focsgWX3Mte9o7zdWDSvknLQiZcKV4Be/54StrKu9ITzd0XKRnAlW0WQ/B aXtEVXh00DaxgBUOFIVWEQHtE8OZLrA4MNYmecf14uzgYUiWkaQriAaFHcxhq5gfzHAdkw lysv+lOlos+6oSBAVB17wjix04rTMoyHsKviT8eiWLBdMi2eectDZ9VPu2RQ9Q== Content-Type: text/plain; charset=UTF-8 To: <9front@9front.org>, "binary cat" <9front@9front.org> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: "Noam Preil" Date: Thu, 05 Aug 2021 18:56:16 -0400 Message-Id: In-Reply-To: <5627a7f6-a3dc-41e5-9a4e-65a188dcf717@pixelhero.dev> X-Migadu-Flow: FLOW_OUT X-Migadu-Auth-User: noam@pixelhero.dev List-ID: <9front.9front.org> List-Help: X-Glyph: ➈ X-Bullshit: rich-client shader Subject: Re: [9front] PDF search bounty Reply-To: 9front@9front.org Precedence: bulk Update: PDF to text conversion now uses a proper rendering system, and understands character sets and encodings, no longer relying on heuristics to determine spacing. The pdffs repository[1] now additionally includes a sample pdfpages.rc script which you can adapt to your needs for PDF searching. The default variant searches for a text string in a PDF and prints out every matching page. Patching it to instead invoke page on the first (or even nth) match is trivial. There is also a pdf2txt script, which converts a full document to text, dumping the result to stdout (and warnings to stderr). This is the correct way to convert a full document, at present, as the object model is currently undergoing some work to fix refcounting so that we can release the memory associated with old pages correctly. Note that, by default, operators which aren't supported yet - which is most graphics - causes pdffs to exit immediately. There is a patch[2] that disables this behavior. Note that you'll want to redirect stderr, as it will spam the output with warnings. - Noam Preil [1]: https://git.sr.ht/~ft/pdffs [2]: https://pixelhero.dev/patches/pdffs_ignore.patch