From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.2 required=5.0 tests=DKIM_INVALID,DKIM_SIGNED
	autolearn=no autolearn_force=no version=3.4.4
Received: (qmail 25814 invoked from network); 5 Aug 2021 22:58:06 -0000
Received: from 1ess.inri.net (216.126.196.35)
  by inbox.vuxu.org with ESMTPUTF8; 5 Aug 2021 22:58:06 -0000
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Received: from out1.migadu.com ([91.121.223.63]) by 1ess; Thu Aug  5 18:52:21 -0400 2021
Content-Transfer-Encoding: quoted-printable
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pixelhero.dev;
	s=key1; t=1628203608;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:  in-reply-to:in-reply-to;
	bh=t6TQXC2GJtP9wuXm9RrTikBcuMR6wVzFL1J+5D/122o=;
	b=GtdPRxfYQPw1tuwI8DIjpUjI0Uokfg5SLFZg4cIey3HirRe7upWTsYGVxxi2Mct5SaP/6a
	erd3NRFk5tLJ596Jfaxl+s5u42qZVLQajmr/wgo5trwsMi/k0fun+XVD9Uah2uIcsiN59M
	CnU/+wYc8Ik4focsgWX3Mte9o7zdWDSvknLQiZcKV4Be/54StrKu9ITzd0XKRnAlW0WQ/B
	aXtEVXh00DaxgBUOFIVWEQHtE8OZLrA4MNYmecf14uzgYUiWkaQriAaFHcxhq5gfzHAdkw
	lysv+lOlos+6oSBAVB17wjix04rTMoyHsKviT8eiWLBdMi2eectDZ9VPu2RQ9Q==
Content-Type: text/plain; charset=UTF-8
To: <9front@9front.org>, "binary cat" <9front@9front.org>
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: "Noam Preil" <noam@pixelhero.dev>
Date: Thu, 05 Aug 2021 18:56:16 -0400
Message-Id: <CDBY7KPVPDRR.CD8EOKV1K2ZK@pixelpc>
In-Reply-To: <5627a7f6-a3dc-41e5-9a4e-65a188dcf717@pixelhero.dev>
X-Migadu-Flow: FLOW_OUT
X-Migadu-Auth-User: noam@pixelhero.dev
List-ID: <9front.9front.org>
List-Help: <http://lists.9front.org>
X-Glyph: ➈
X-Bullshit: rich-client shader 
Subject: Re: [9front] PDF search bounty
Reply-To: 9front@9front.org
Precedence: bulk

Update:

PDF to text conversion now uses a proper rendering system, and
understands character sets and encodings, no longer relying on
heuristics to determine spacing.

The pdffs repository[1] now additionally includes a sample pdfpages.rc
script which you can adapt to your needs for PDF searching. The default
variant searches for a text string in a PDF and prints out every
matching page. Patching it to instead invoke page on the first (or even
nth) match is trivial.

There is also a pdf2txt script, which converts a full document to text,
dumping the result to stdout (and warnings to stderr). This is the
correct way to convert a full document, at present, as the object model
is currently undergoing some work to fix refcounting so that we can
release the memory associated with old pages correctly.

Note that, by default, operators which aren't supported yet - which is
most graphics - causes pdffs to exit immediately. There is a patch[2]
that disables this behavior. Note that you'll want to redirect stderr,
as it will spam the output with warnings.

- Noam Preil

[1]: https://git.sr.ht/~ft/pdffs
[2]: https://pixelhero.dev/patches/pdffs_ignore.patch