9front - general discussion about 9front
 help / color / mirror / Atom feed
* [9front] PDF search bounty
@ 2021-05-30 20:10 binary cat
  2021-05-30 22:59 ` Stanley Lieber
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: binary cat @ 2021-05-30 20:10 UTC (permalink / raw)
  To: 9front

What is the state of the $200 bounty on searching through PDFs?
I thought I might give it a shot.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9front] PDF search bounty
  2021-05-30 20:10 [9front] PDF search bounty binary cat
@ 2021-05-30 22:59 ` Stanley Lieber
  2021-05-30 23:09   ` Sigrid Solveig Haflínudóttir
                     ` (2 more replies)
  2021-05-30 23:38 ` ori
  2021-06-01 22:29 ` Noam Preil
  2 siblings, 3 replies; 9+ messages in thread
From: Stanley Lieber @ 2021-05-30 22:59 UTC (permalink / raw)
  To: 9front

On May 30, 2021 4:10:56 PM EDT, binary cat <dogedoge61@gmail.com> wrote:
>What is the state of the $200 bounty on searching through PDFs?
>I thought I might give it a shot.
>

i'm not aware of anyone having done any work on this.

sl

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9front] PDF search bounty
  2021-05-30 22:59 ` Stanley Lieber
@ 2021-05-30 23:09   ` Sigrid Solveig Haflínudóttir
  2021-05-30 23:23   ` Romano
  2021-05-30 23:24   ` Noam Preil
  2 siblings, 0 replies; 9+ messages in thread
From: Sigrid Solveig Haflínudóttir @ 2021-05-30 23:09 UTC (permalink / raw)
  To: 9front, Stanley Lieber

https://git.sr.ht/~ft/pdffs

it did not get accepted by p9f as a gsoc project but noam was (is) going to work on it any case, as far as I understand.

patches welcome.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9front] PDF search bounty
  2021-05-30 22:59 ` Stanley Lieber
  2021-05-30 23:09   ` Sigrid Solveig Haflínudóttir
@ 2021-05-30 23:23   ` Romano
  2021-05-30 23:33     ` Sigrid Solveig Haflínudóttir
  2021-05-30 23:24   ` Noam Preil
  2 siblings, 1 reply; 9+ messages in thread
From: Romano @ 2021-05-30 23:23 UTC (permalink / raw)
  To: 9front

What is included in Sigrid's attempted pdffs? Perhaps that would include search.

On May 30, 2021 10:59:04 PM UTC, Stanley Lieber <sl@stanleylieber.com> wrote:
>On May 30, 2021 4:10:56 PM EDT, binary cat <dogedoge61@gmail.com>
>wrote:
>>What is the state of the $200 bounty on searching through PDFs?
>>I thought I might give it a shot.
>>
>
>i'm not aware of anyone having done any work on this.
>
>sl

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9front] PDF search bounty
  2021-05-30 22:59 ` Stanley Lieber
  2021-05-30 23:09   ` Sigrid Solveig Haflínudóttir
  2021-05-30 23:23   ` Romano
@ 2021-05-30 23:24   ` Noam Preil
  2 siblings, 0 replies; 9+ messages in thread
From: Noam Preil @ 2021-05-30 23:24 UTC (permalink / raw)
  To: Stanley Lieber

Hi,

I'm working on this :)

- Noam Preil

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9front] PDF search bounty
  2021-05-30 23:23   ` Romano
@ 2021-05-30 23:33     ` Sigrid Solveig Haflínudóttir
  0 siblings, 0 replies; 9+ messages in thread
From: Sigrid Solveig Haflínudóttir @ 2021-05-30 23:33 UTC (permalink / raw)
  To: 9front

Quoth Romano <unobe@cpan.org>:
> What is included in Sigrid's attempted pdffs? Perhaps that would include search.
> 
> On May 30, 2021 10:59:04 PM UTC, Stanley Lieber <sl@stanleylieber.com> wrote:
> >On May 30, 2021 4:10:56 PM EDT, binary cat <dogedoge61@gmail.com>
> >wrote:
> >>What is the state of the $200 bounty on searching through PDFs?
> >>I thought I might give it a shot.
> >>
> >
> >i'm not aware of anyone having done any work on this.
> >
> >sl

Mostly just object extraction.  Text, images, etc.  Unpacking (gzip,
lzw and so on).  The part that is required for pdf2text is no there,
but is allegedly not too complex to implement.  Page contents usually
are a bunch of drawing operations that also include parts of text
being placed in specific locations (defined by coordinates X and Y) on
the page.  Search was definitely part of the plan.

Further development has been stalled due to assumption it might get
accepted as a GSOC project.  Since that did not happen, I will
continue as I have free time (and will) for this.  Noam might do that
too.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9front] PDF search bounty
  2021-05-30 20:10 [9front] PDF search bounty binary cat
  2021-05-30 22:59 ` Stanley Lieber
@ 2021-05-30 23:38 ` ori
  2021-05-31 18:12   ` binary cat
  2021-06-01 22:29 ` Noam Preil
  2 siblings, 1 reply; 9+ messages in thread
From: ori @ 2021-05-30 23:38 UTC (permalink / raw)
  To: 9front

Quoth binary cat <dogedoge61@gmail.com>:
> What is the state of the $200 bounty on searching through PDFs?
> I thought I might give it a shot.
> 

sigrid's started working on pdffs, and noam
has also claimed interest.

But, don't let that stop you -- collaboration
is good, and I'd be happy to give multiple $200
bounties out if multiple people were needed to
get it over the finish line.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9front] PDF search bounty
  2021-05-30 23:38 ` ori
@ 2021-05-31 18:12   ` binary cat
  0 siblings, 0 replies; 9+ messages in thread
From: binary cat @ 2021-05-31 18:12 UTC (permalink / raw)
  To: 9front

For extracting text from pdfs, I found
`/sys/lib/ghostscript/ps2ascii.gs` to be useful.
Currently I just have an awk+rc script wrapped around it, already does
a decent job,
using `plumb` to bring up the correct page. I modified `page` to
accept x/y coordinate arguments,
but it doesn't use the same scaling/origin as `ps2ascii.gs`, so that's
giving me some trouble.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9front] PDF search bounty
  2021-05-30 20:10 [9front] PDF search bounty binary cat
  2021-05-30 22:59 ` Stanley Lieber
  2021-05-30 23:38 ` ori
@ 2021-06-01 22:29 ` Noam Preil
  2 siblings, 0 replies; 9+ messages in thread
From: Noam Preil @ 2021-06-01 22:29 UTC (permalink / raw)
  To: binary cat

Hey,

Added basic heuristics based text extraction to pdf fs!

http://pixelhero.dev/tmp/heuristics5.png 
http://pixelhero.dev/tmp/search.png

Working on proper rendering soon :)

- Noam

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-06-02  0:46 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-30 20:10 [9front] PDF search bounty binary cat
2021-05-30 22:59 ` Stanley Lieber
2021-05-30 23:09   ` Sigrid Solveig Haflínudóttir
2021-05-30 23:23   ` Romano
2021-05-30 23:33     ` Sigrid Solveig Haflínudóttir
2021-05-30 23:24   ` Noam Preil
2021-05-30 23:38 ` ori
2021-05-31 18:12   ` binary cat
2021-06-01 22:29 ` Noam Preil

9front - general discussion about 9front

This inbox may be cloned and mirrored by anyone:

	git clone --mirror http://inbox.vuxu.org/9front

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 9front 9front/ http://inbox.vuxu.org/9front \
		9front@9front.org
	public-inbox-index 9front

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://inbox.vuxu.org/vuxu.archive.9front


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git