9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] text search in PDF?
@ 2004-05-14  6:03 cej
  2004-05-14 10:05 ` a
  0 siblings, 1 reply; 8+ messages in thread
From: cej @ 2004-05-14  6:03 UTC (permalink / raw)
  To: 9fans

Hi, folks!

Can I look for a string while browsing a PS/PDF (in page). If not, would it be difficult to implement?
No one else misses this feature?

Thanks, regards,
Peter.


Peter A. Cejchan
Lab of Paleobiology and Paleoecology
Institute of Geology
Academy of Sciences
Rozvojova 135
16502 Prague, Czechia
http://www.gli.cas.cz/home/cejchan


---
Odchozí zpráva obsahuje viry.
Zkontrolováno antivirovým systémem AVG (http://www.grisoft.cz).
Verze: 6.0.682 / Virová báze: 444 - datum vydání: 11. 5. 2004
 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] text search in PDF?
  2004-05-14  6:03 [9fans] text search in PDF? cej
@ 2004-05-14 10:05 ` a
  2004-05-14 12:48   ` boyd, rounin
  0 siblings, 1 reply; 8+ messages in thread
From: a @ 2004-05-14 10:05 UTC (permalink / raw)
  To: 9fans

// Can I look for a string while browsing a PS/PDF

some PS/PDF (i've confirmed what i'm about to say in one, but
not both, and can't remember which) files include the actual
text of their content; others only treat the letters as more
graphics. searching for text is possible in the first but not
the later (without implementing OCR). i cannot speak to how
difficult it would be to implement said search.

i, too, miss this feature. 
ア


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] text search in PDF?
  2004-05-14 10:05 ` a
@ 2004-05-14 12:48   ` boyd, rounin
  2004-05-14 17:17     ` Lyndon Nerenberg
  0 siblings, 1 reply; 8+ messages in thread
From: boyd, rounin @ 2004-05-14 12:48 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> // Can I look for a string while browsing a PS/PDF
>
> some PS/PDF (i've confirmed what i'm about to say in one, but
> not both, and can't remember which) files include the actual
> text of their content;

iirc it's PS, but it's a nasty problem anyway.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] text search in PDF?
  2004-05-14 12:48   ` boyd, rounin
@ 2004-05-14 17:17     ` Lyndon Nerenberg
  0 siblings, 0 replies; 8+ messages in thread
From: Lyndon Nerenberg @ 2004-05-14 17:17 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

--On 2004-5-14 2:48 PM +0200 "boyd, rounin" <boyd@insultant.net> wrote:

>> // Can I look for a string while browsing a PS/PDF
[...]
> iirc it's PS, but it's a nasty problem anyway.

Maybe take a look at how xpdf supports string searches?

--lyndon


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [9fans] text search in PDF?
  2004-05-14 14:22 ` dvd
  2004-05-14 15:47   ` splite
@ 2004-05-14 16:45   ` a
  1 sibling, 0 replies; 8+ messages in thread
From: a @ 2004-05-14 16:45 UTC (permalink / raw)
  To: 9fans

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="UTF-8", Size: 571 bytes --]

// Rare cases when rasterized images of text pages are wrapped
// into Postscript or PDF formats are mostly due to a need to
// somehow publish scanned documents; or due to faulty conversion
// toolchain.

i can't comment on the causes, although your assesment seems quite
reasonable. i can say, however, that when i looked at the issue a
year or two (not more) ago, these cases were certainly not rare.
my measurements were quite unscientific, but it looked to be about
50% of the documents i examined (mostly pulled indiscriminatly from
the web).
*â€
α


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] text search in PDF?
  2004-05-14 14:22 ` dvd
@ 2004-05-14 15:47   ` splite
  2004-05-14 16:45   ` a
  1 sibling, 0 replies; 8+ messages in thread
From: splite @ 2004-05-14 15:47 UTC (permalink / raw)
  To: 9fans

On Fri, May 14, 2004 at 07:22:06PM +0500, dvd@davidashen.net wrote:
>
> The problem is high quality formatting -- most documents which are
> kerned  are kerned explicitely -- words are broken into parts, and
> displacements are set for word parts, which makes searching for whole
> words impossible.

Funny how things come full-circle.  We used to need preview images
attached to Encapsulated PostScript files because on-screen rendering
was too slow.  Now it seems we need "plain-text thumbnails" embedded
in PDF files to facilitate searches, braille or spoken output, etc.

Maybe PDF already has that capability but nobody uses it.  (Almost
nobody used the EPSI format either.)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [9fans] text search in PDF?
  2004-05-14 13:04 Trickey, Howard W (Howard)
@ 2004-05-14 14:22 ` dvd
  2004-05-14 15:47   ` splite
  2004-05-14 16:45   ` a
  0 siblings, 2 replies; 8+ messages in thread
From: dvd @ 2004-05-14 14:22 UTC (permalink / raw)
  To: 9fans

> it's not strictly PS, though they share the imaging model and "The PDF
> operators for setting the graphics state and painting graphics
> objects are similar to the corresponding operators in the
> PostScript language".  Close enough to PS for text searching.
> The main annoyance is that PDF files usually compress their
> content after a short bit of PS-like prolog.

Both Postscript and PDF use text strings to represent strings.
Rare cases when rasterized images of text pages are wrapped
into Postscript or PDF formats are mostly due to a need to
somehow publish scanned documents; or due to faulty conversion
toolchain.

Compression is not a problem. To render a PDF, one has to uncompress
its contents anyway. And decompression is trivial.

The problem is high quality formatting -- most documents which are
kerned  are kerned explicitely -- words are broken into parts, and displacements
are set for word parts, which makes searching for whole words impossible.

David Tolpin



^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [9fans] text search in PDF?
@ 2004-05-14 13:04 Trickey, Howard W (Howard)
  2004-05-14 14:22 ` dvd
  0 siblings, 1 reply; 8+ messages in thread
From: Trickey, Howard W (Howard) @ 2004-05-14 13:04 UTC (permalink / raw)
  To: 'Fans of the OS Plan 9 from Bell Labs'

>> some PS/PDF (i've confirmed what i'm about to say in one, but
>> not both, and can't remember which) files include the actual
>> text of their content;

> iirc it's PS, but it's a nasty problem anyway.

it's not strictly PS, though they share the imaging model and "The PDF
operators for setting the graphics state and painting graphics
objects are similar to the corresponding operators in the
PostScript language".  Close enough to PS for text searching.
The main annoyance is that PDF files usually compress their
content after a short bit of PS-like prolog.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2004-05-14 17:17 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-05-14  6:03 [9fans] text search in PDF? cej
2004-05-14 10:05 ` a
2004-05-14 12:48   ` boyd, rounin
2004-05-14 17:17     ` Lyndon Nerenberg
2004-05-14 13:04 Trickey, Howard W (Howard)
2004-05-14 14:22 ` dvd
2004-05-14 15:47   ` splite
2004-05-14 16:45   ` a

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).