9front - general discussion about 9front
 help / color / mirror / Atom feed
* [9front] PDF search bounty
@ 2021-05-30 20:10 binary cat
  2021-05-30 22:59 ` Stanley Lieber
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: binary cat @ 2021-05-30 20:10 UTC (permalink / raw)
  To: 9front

What is the state of the $200 bounty on searching through PDFs?
I thought I might give it a shot.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] PDF search bounty
  2021-05-30 20:10 [9front] PDF search bounty binary cat
@ 2021-05-30 22:59 ` Stanley Lieber
  2021-05-30 23:09   ` Sigrid Solveig Haflínudóttir
                     ` (2 more replies)
  2021-05-30 23:38 ` ori
  2021-06-01 22:29 ` Noam Preil
  2 siblings, 3 replies; 10+ messages in thread
From: Stanley Lieber @ 2021-05-30 22:59 UTC (permalink / raw)
  To: 9front

On May 30, 2021 4:10:56 PM EDT, binary cat <dogedoge61@gmail.com> wrote:
>What is the state of the $200 bounty on searching through PDFs?
>I thought I might give it a shot.
>

i'm not aware of anyone having done any work on this.

sl

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] PDF search bounty
  2021-05-30 22:59 ` Stanley Lieber
@ 2021-05-30 23:09   ` Sigrid Solveig Haflínudóttir
  2021-05-30 23:23   ` Romano
  2021-05-30 23:24   ` Noam Preil
  2 siblings, 0 replies; 10+ messages in thread
From: Sigrid Solveig Haflínudóttir @ 2021-05-30 23:09 UTC (permalink / raw)
  To: 9front, Stanley Lieber

https://git.sr.ht/~ft/pdffs

it did not get accepted by p9f as a gsoc project but noam was (is) going to work on it any case, as far as I understand.

patches welcome.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] PDF search bounty
  2021-05-30 22:59 ` Stanley Lieber
  2021-05-30 23:09   ` Sigrid Solveig Haflínudóttir
@ 2021-05-30 23:23   ` Romano
  2021-05-30 23:33     ` Sigrid Solveig Haflínudóttir
  2021-05-30 23:24   ` Noam Preil
  2 siblings, 1 reply; 10+ messages in thread
From: Romano @ 2021-05-30 23:23 UTC (permalink / raw)
  To: 9front

What is included in Sigrid's attempted pdffs? Perhaps that would include search.

On May 30, 2021 10:59:04 PM UTC, Stanley Lieber <sl@stanleylieber.com> wrote:
>On May 30, 2021 4:10:56 PM EDT, binary cat <dogedoge61@gmail.com>
>wrote:
>>What is the state of the $200 bounty on searching through PDFs?
>>I thought I might give it a shot.
>>
>
>i'm not aware of anyone having done any work on this.
>
>sl

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] PDF search bounty
  2021-05-30 22:59 ` Stanley Lieber
  2021-05-30 23:09   ` Sigrid Solveig Haflínudóttir
  2021-05-30 23:23   ` Romano
@ 2021-05-30 23:24   ` Noam Preil
  2 siblings, 0 replies; 10+ messages in thread
From: Noam Preil @ 2021-05-30 23:24 UTC (permalink / raw)
  To: Stanley Lieber

Hi,

I'm working on this :)

- Noam Preil

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] PDF search bounty
  2021-05-30 23:23   ` Romano
@ 2021-05-30 23:33     ` Sigrid Solveig Haflínudóttir
  0 siblings, 0 replies; 10+ messages in thread
From: Sigrid Solveig Haflínudóttir @ 2021-05-30 23:33 UTC (permalink / raw)
  To: 9front

Quoth Romano <unobe@cpan.org>:
> What is included in Sigrid's attempted pdffs? Perhaps that would include search.
> 
> On May 30, 2021 10:59:04 PM UTC, Stanley Lieber <sl@stanleylieber.com> wrote:
> >On May 30, 2021 4:10:56 PM EDT, binary cat <dogedoge61@gmail.com>
> >wrote:
> >>What is the state of the $200 bounty on searching through PDFs?
> >>I thought I might give it a shot.
> >>
> >
> >i'm not aware of anyone having done any work on this.
> >
> >sl

Mostly just object extraction.  Text, images, etc.  Unpacking (gzip,
lzw and so on).  The part that is required for pdf2text is no there,
but is allegedly not too complex to implement.  Page contents usually
are a bunch of drawing operations that also include parts of text
being placed in specific locations (defined by coordinates X and Y) on
the page.  Search was definitely part of the plan.

Further development has been stalled due to assumption it might get
accepted as a GSOC project.  Since that did not happen, I will
continue as I have free time (and will) for this.  Noam might do that
too.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] PDF search bounty
  2021-05-30 20:10 [9front] PDF search bounty binary cat
  2021-05-30 22:59 ` Stanley Lieber
@ 2021-05-30 23:38 ` ori
  2021-05-31 18:12   ` binary cat
  2021-06-01 22:29 ` Noam Preil
  2 siblings, 1 reply; 10+ messages in thread
From: ori @ 2021-05-30 23:38 UTC (permalink / raw)
  To: 9front

Quoth binary cat <dogedoge61@gmail.com>:
> What is the state of the $200 bounty on searching through PDFs?
> I thought I might give it a shot.
> 

sigrid's started working on pdffs, and noam
has also claimed interest.

But, don't let that stop you -- collaboration
is good, and I'd be happy to give multiple $200
bounties out if multiple people were needed to
get it over the finish line.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] PDF search bounty
  2021-05-30 23:38 ` ori
@ 2021-05-31 18:12   ` binary cat
  0 siblings, 0 replies; 10+ messages in thread
From: binary cat @ 2021-05-31 18:12 UTC (permalink / raw)
  To: 9front

For extracting text from pdfs, I found
`/sys/lib/ghostscript/ps2ascii.gs` to be useful.
Currently I just have an awk+rc script wrapped around it, already does
a decent job,
using `plumb` to bring up the correct page. I modified `page` to
accept x/y coordinate arguments,
but it doesn't use the same scaling/origin as `ps2ascii.gs`, so that's
giving me some trouble.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] PDF search bounty
  2021-05-30 20:10 [9front] PDF search bounty binary cat
  2021-05-30 22:59 ` Stanley Lieber
  2021-05-30 23:38 ` ori
@ 2021-06-01 22:29 ` Noam Preil
  2021-08-05 22:56   ` Noam Preil
  2 siblings, 1 reply; 10+ messages in thread
From: Noam Preil @ 2021-06-01 22:29 UTC (permalink / raw)
  To: binary cat

Hey,

Added basic heuristics based text extraction to pdf fs!

http://pixelhero.dev/tmp/heuristics5.png 
http://pixelhero.dev/tmp/search.png

Working on proper rendering soon :)

- Noam

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] PDF search bounty
  2021-06-01 22:29 ` Noam Preil
@ 2021-08-05 22:56   ` Noam Preil
  0 siblings, 0 replies; 10+ messages in thread
From: Noam Preil @ 2021-08-05 22:56 UTC (permalink / raw)
  To: 9front, binary cat

Update:

PDF to text conversion now uses a proper rendering system, and
understands character sets and encodings, no longer relying on
heuristics to determine spacing.

The pdffs repository[1] now additionally includes a sample pdfpages.rc
script which you can adapt to your needs for PDF searching. The default
variant searches for a text string in a PDF and prints out every
matching page. Patching it to instead invoke page on the first (or even
nth) match is trivial.

There is also a pdf2txt script, which converts a full document to text,
dumping the result to stdout (and warnings to stderr). This is the
correct way to convert a full document, at present, as the object model
is currently undergoing some work to fix refcounting so that we can
release the memory associated with old pages correctly.

Note that, by default, operators which aren't supported yet - which is
most graphics - causes pdffs to exit immediately. There is a patch[2]
that disables this behavior. Note that you'll want to redirect stderr,
as it will spam the output with warnings.

- Noam Preil

[1]: https://git.sr.ht/~ft/pdffs
[2]: https://pixelhero.dev/patches/pdffs_ignore.patch


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-08-05 22:58 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-30 20:10 [9front] PDF search bounty binary cat
2021-05-30 22:59 ` Stanley Lieber
2021-05-30 23:09   ` Sigrid Solveig Haflínudóttir
2021-05-30 23:23   ` Romano
2021-05-30 23:33     ` Sigrid Solveig Haflínudóttir
2021-05-30 23:24   ` Noam Preil
2021-05-30 23:38 ` ori
2021-05-31 18:12   ` binary cat
2021-06-01 22:29 ` Noam Preil
2021-08-05 22:56   ` Noam Preil

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).