9front - general discussion about 9front
 help / color / mirror / Atom feed
From: "Noam Preil" <noam@pixelhero.dev>
To: <9front@9front.org>, "binary cat" <9front@9front.org>
Subject: Re: [9front] PDF search bounty
Date: Thu, 05 Aug 2021 18:56:16 -0400	[thread overview]
Message-ID: <CDBY7KPVPDRR.CD8EOKV1K2ZK@pixelpc> (raw)
In-Reply-To: <5627a7f6-a3dc-41e5-9a4e-65a188dcf717@pixelhero.dev>

Update:

PDF to text conversion now uses a proper rendering system, and
understands character sets and encodings, no longer relying on
heuristics to determine spacing.

The pdffs repository[1] now additionally includes a sample pdfpages.rc
script which you can adapt to your needs for PDF searching. The default
variant searches for a text string in a PDF and prints out every
matching page. Patching it to instead invoke page on the first (or even
nth) match is trivial.

There is also a pdf2txt script, which converts a full document to text,
dumping the result to stdout (and warnings to stderr). This is the
correct way to convert a full document, at present, as the object model
is currently undergoing some work to fix refcounting so that we can
release the memory associated with old pages correctly.

Note that, by default, operators which aren't supported yet - which is
most graphics - causes pdffs to exit immediately. There is a patch[2]
that disables this behavior. Note that you'll want to redirect stderr,
as it will spam the output with warnings.

- Noam Preil

[1]: https://git.sr.ht/~ft/pdffs
[2]: https://pixelhero.dev/patches/pdffs_ignore.patch


      reply	other threads:[~2021-08-05 22:58 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-30 20:10 binary cat
2021-05-30 22:59 ` Stanley Lieber
2021-05-30 23:09   ` Sigrid Solveig Haflínudóttir
2021-05-30 23:23   ` Romano
2021-05-30 23:33     ` Sigrid Solveig Haflínudóttir
2021-05-30 23:24   ` Noam Preil
2021-05-30 23:38 ` ori
2021-05-31 18:12   ` binary cat
2021-06-01 22:29 ` Noam Preil
2021-08-05 22:56   ` Noam Preil [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CDBY7KPVPDRR.CD8EOKV1K2ZK@pixelpc \
    --to=noam@pixelhero.dev \
    --cc=9front@9front.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).