What is the state of the $200 bounty on searching through PDFs? I thought I might give it a shot.
On May 30, 2021 4:10:56 PM EDT, binary cat <dogedoge61@gmail.com> wrote:
>What is the state of the $200 bounty on searching through PDFs?
>I thought I might give it a shot.
>
i'm not aware of anyone having done any work on this.
sl
https://git.sr.ht/~ft/pdffs it did not get accepted by p9f as a gsoc project but noam was (is) going to work on it any case, as far as I understand. patches welcome.
What is included in Sigrid's attempted pdffs? Perhaps that would include search.
On May 30, 2021 10:59:04 PM UTC, Stanley Lieber <sl@stanleylieber.com> wrote:
>On May 30, 2021 4:10:56 PM EDT, binary cat <dogedoge61@gmail.com>
>wrote:
>>What is the state of the $200 bounty on searching through PDFs?
>>I thought I might give it a shot.
>>
>
>i'm not aware of anyone having done any work on this.
>
>sl
Hi, I'm working on this :) - Noam Preil
Quoth Romano <unobe@cpan.org>:
> What is included in Sigrid's attempted pdffs? Perhaps that would include search.
>
> On May 30, 2021 10:59:04 PM UTC, Stanley Lieber <sl@stanleylieber.com> wrote:
> >On May 30, 2021 4:10:56 PM EDT, binary cat <dogedoge61@gmail.com>
> >wrote:
> >>What is the state of the $200 bounty on searching through PDFs?
> >>I thought I might give it a shot.
> >>
> >
> >i'm not aware of anyone having done any work on this.
> >
> >sl
Mostly just object extraction. Text, images, etc. Unpacking (gzip,
lzw and so on). The part that is required for pdf2text is no there,
but is allegedly not too complex to implement. Page contents usually
are a bunch of drawing operations that also include parts of text
being placed in specific locations (defined by coordinates X and Y) on
the page. Search was definitely part of the plan.
Further development has been stalled due to assumption it might get
accepted as a GSOC project. Since that did not happen, I will
continue as I have free time (and will) for this. Noam might do that
too.
Quoth binary cat <dogedoge61@gmail.com>:
> What is the state of the $200 bounty on searching through PDFs?
> I thought I might give it a shot.
>
sigrid's started working on pdffs, and noam
has also claimed interest.
But, don't let that stop you -- collaboration
is good, and I'd be happy to give multiple $200
bounties out if multiple people were needed to
get it over the finish line.
For extracting text from pdfs, I found `/sys/lib/ghostscript/ps2ascii.gs` to be useful. Currently I just have an awk+rc script wrapped around it, already does a decent job, using `plumb` to bring up the correct page. I modified `page` to accept x/y coordinate arguments, but it doesn't use the same scaling/origin as `ps2ascii.gs`, so that's giving me some trouble.
Hey, Added basic heuristics based text extraction to pdf fs! http://pixelhero.dev/tmp/heuristics5.png http://pixelhero.dev/tmp/search.png Working on proper rendering soon :) - Noam
Update: PDF to text conversion now uses a proper rendering system, and understands character sets and encodings, no longer relying on heuristics to determine spacing. The pdffs repository[1] now additionally includes a sample pdfpages.rc script which you can adapt to your needs for PDF searching. The default variant searches for a text string in a PDF and prints out every matching page. Patching it to instead invoke page on the first (or even nth) match is trivial. There is also a pdf2txt script, which converts a full document to text, dumping the result to stdout (and warnings to stderr). This is the correct way to convert a full document, at present, as the object model is currently undergoing some work to fix refcounting so that we can release the memory associated with old pages correctly. Note that, by default, operators which aren't supported yet - which is most graphics - causes pdffs to exit immediately. There is a patch[2] that disables this behavior. Note that you'll want to redirect stderr, as it will spam the output with warnings. - Noam Preil [1]: https://git.sr.ht/~ft/pdffs [2]: https://pixelhero.dev/patches/pdffs_ignore.patch