[COFF] converting lousy scans of pdfs into something more useable

Computer Old Farts Forum
 help / color / mirror / Atom feed

* [COFF] converting lousy scans of pdfs into something more useable
@ 2023-02-03 15:27 Will Senn
  2023-02-03 16:00 ` [COFF] " Dennis Boone
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Will Senn @ 2023-02-03 15:27 UTC (permalink / raw)
  To: coff

[-- Attachment #1: Type: text/plain, Size: 5744 bytes --]

All,

I thought I would post something here that wasn't DOA over on tuhs and 
see if it would fly here instead. I have been treating coff as the 
destination for the place where off-topic tuhs posts go to die, but 
after the latest thread bemoaning a place to go for topics tangential to 
unix, I thought I'd actually start a coff thread! Here goes...

I read a tremendous number of documents from the web, or at least read 
parts of them - to the tune of maybe 50 or so a week. It is appalling to 
me in this era that we can't get better at scanning. Be that as it may, 
the needle doesn't seem to have moved appreciably in the last decade or 
so and it's a little sad. Sure, if folks print to pdf, it's great. But, 
if they scan a doc, not so great, even today.

Rather than worry about the scanning aspects, I am more interested in 
what to do with those scans. Can they be handled in such a way as to 
give them new life? Unlike the scanning side of things, I have found 
quite a bit of movement in the area of being able to work with the pdfs 
and I'd really like to get way better at it. If I get a bad scanned pdf, 
if I can make it legible on screen, legible on print, and searchable, 
I'm golden. Sadly, that's way harder than it sounds, or, in my opinion, 
than it should be.

I recently put together a workflow that is tenable, if time consuming. 
If your interested in the details, I've shared them:

https://decuser.github.io/pdfs/2023/02/01/pdf-cleanup-workflow.html

In the note, I leverage a lot of great tools that have significantly 
improved over the years to the point where they do a great job at what 
they do. But, there's lots of room for improvement. Particularly in the 
area of image tweaking around color and highlights and such.

The note is mac-centric in that I use a mac, otherwise, all of the tools 
work on modern *nix and with a little abstract thought, windows too.

In my world, here's what happens:

* find a really interesting topic and along the way, collect pdfs to read
* open the pdf and find it salient, but not so readable, with sad 
printability, and no or broken OCR
* I begin the process of making the pdf better with the aforementioned 
goals aforethought

The process in a nutshell:

1. Extract the images to individual tiffs (so many tools can't work with 
multi-image tiffs)

     * pdfimages from poppler works great for this

2. Adjust the color (it seems impossible to do this without a batch 
capable gui app)

     * I use Photoscape X for this - just click batch and make 
adjustments to all of the images using the same settings

3. Resize the images - most pdfs have super wonky sizes

     * I use convert from imagemagick for this and I compress the tiffs 
while I'm converting them

4. Recombine the images into a multi-tiff image

     * I use tiffcp from libtiff for this

5. OCR the reworked image set

     * I use tesseract for this - It's gotten so much better it's ridiculous

This process results in a pdf that meets the objectives.

It's not horribly difficult to do and it's not horribly time consuming. 
It represents many, many attempts to figure out this thorny problem.

I'd really like to get away from needing Photoscape X, though. Then I 
could entirely automate the workflow in bash...

The problem is that the image adjustments are the most critical - image 
extraction, resize, compression, recombining images, ocr (I still can't 
believe it), and outputting a pdf are now taken care of by command line 
tools that work well.

I wouldn't mind using a gui to figure out some color setting (Grayscale, 
Black and White, or Color) and increase/decrease values for shadows and 
highlights if those could then be mapped to command line arguments of a 
tool that could apply them, though. Cuz, then the workflow could be, 
extract a good representative page as image, open it, figure out the 
color settings, and then use those settings with toolY as part of the 
scripted workflow.

Here are the objectives for easy reference:

1. The PDF needs to be readable on a decent monitor (zooming in doesn't 
distort the readability, pixelation that is systematic is ok, but not 
preferred). Yes, I know it's got a degree of subjectivity, but blobby, 
bleeding text is out of scope!

2. The PDF needs to print with a minimum of artifact (weird shadows, 
bleeding and blob are out). It needs to be easy to read.

3. The PDF needs to be searchable with good accuracy (generally, bad 
scans have no ocr, or ocr that doesn't work).

Size is a consideration, but depends greatly on the value of the work. 
My own calculus goes like this - if it's modern work, it should be way 
under 30mbs. If it's print to pdf, it should be way under 10mb (remember 
when you thought you'd never use 10mb of space... for all of your files 
and the os). If it is significant and rare, less than 150mbs can work. 
Obviously, this is totally subjective, your calculus is probably quite 
different.

The reason this isn't posted over in pdf.scans.discussion is that even 
if there were such a place, it'd be filled with super technical 
gibberish about color depth and the perils of gamma radiation or 
somesuch. We, as folks interested in preserving the past have a more 
pragmatic need for a workable solution that is attainable to mortals.

So, with that as a bit of background, let me ask what I asked previously 
in a different wayon tuhs, here in coff - what's your experience with 
using sad pdfs? Do you just live with them as they are, or do you try to 
fix them and how, or do you use a workflow and get good results?

Later,

Will

[-- Attachment #2: Type: text/html, Size: 7282 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [COFF] Re: converting lousy scans of pdfs into something more useable
  2023-02-03 15:27 [COFF] converting lousy scans of pdfs into something more useable Will Senn
@ 2023-02-03 16:00 ` Dennis Boone
  2023-02-03 16:01 ` Bakul Shah
  2023-02-04  7:59 ` Ralph Corderoy
  2 siblings, 0 replies; 7+ messages in thread
From: Dennis Boone @ 2023-02-03 16:00 UTC (permalink / raw)
  To: coff

 > I read a tremendous number of documents from the web, or at least
 > read parts of them - to the tune of maybe 50 or so a week. It is
 > appalling to me in this era that we can't get better at scanning. Be
 > that as it may, the needle doesn't seem to have moved appreciably in
 > the last decade or so and it's a little sad. Sure, if folks print to
 > pdf, it's great. But, if they scan a doc, not so great, even today.

I see a fair number of frustrating scanned-doc PDFs too.  My thoughts on
what constitutes a decent scan:

* Assume people will print at least a few pages occasionally.  It's
  often easier to print that one table or diagram and take it to the
  bench than to try to use a tablet or run back and forth to a PC.  That
  affects how you think about creating the PDF.

* Don't use JPEG 2000 and similar compression algorithms that try to
  re-use blocks of pixels from elsewhere in the document -- too many
  errors, and they're errors of the sort that can be critical.  Even if
  the replacements use the correct code point, they're distracting as
  hell in a different font, size, etc.

* OCR-under is good.  I use `ocrmypdf`, which uses the Tesseract engine.

* I do get angry when I see people trying to reconstruct the document
  via OCR and omitting the actual scan -- too many errors.

* Bookmarks for pages / table of contents entries / etc are mandatory.
  Very few things make a scanned-doc PDF less useful than not being able
  to skip directly to a document indicated page.

* I like to see at least 300 dpi.

* Don't scan in color mode if the source material isn't color.  Grey
  scale or even "line art" works fine in most cases.  Using one pixel
  means you can use G4 compression for colorless pages.

* Do reduce the color depth of pages that do contain color if you can.
  The resulting PDF can contain a mix of image types.  I've worked with
  documents that did use color where four or eight colors were enough,
  and the whole document could be mapped to them.  With care, you _can_
  force the scans down to two or three bits per pixel.

* Do insert sensible metadata.

* Do try to square up the inevitably crooked scans, clean up major
  floobydust and whatever crud around the edges isn't part of the paper,
  etc.  Besides making the result more readable, it'll help the OCR.  I
  never have any luck with automated page orientation tooling for some
  reason, so end up just doing this with Gimp.

Tuppence.

De

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [COFF] Re: converting lousy scans of pdfs into something more useable
  2023-02-03 15:27 [COFF] converting lousy scans of pdfs into something more useable Will Senn
  2023-02-03 16:00 ` [COFF] " Dennis Boone
@ 2023-02-03 16:01 ` Bakul Shah
  2023-02-03 16:25   ` Will Senn
  2023-02-04  7:59 ` Ralph Corderoy
  2 siblings, 1 reply; 7+ messages in thread
From: Bakul Shah @ 2023-02-03 16:01 UTC (permalink / raw)
  To: Will Senn; +Cc: coff

On Feb 3, 2023, at 7:27 AM, Will Senn <will.senn@gmail.com> wrote:
> 
> what's your experience with using sad pdfs? Do you just live with them as they are, or do you try to fix them and how, or do you use a workflow and get good results?

Usually I just live with them but I may use "ocrmypdf" if search
or copy-paste is unsatisfactory.

https://github.com/ocrmypdf/OCRmyPDF

It's a python script that runs most any unix and uses
tesseract. Its author's motivation seems similar to yours:

I searched the web for a free command line tool to OCR PDF files: I found many, but none of them were really satisfying:
    • Either they produced PDF files with misplaced text under the image (making copy/paste impossible)
    • Or they did not handle accents and multilingual characters
    • Or they changed the resolution of the embedded images
    • Or they generated ridiculously large PDF files
    • Or they crashed when trying to OCR
    • Or they did not produce valid PDF files
    • On top of that none of them produced PDF/A files (format dedicated for long time storage)
...so I decided to develop my own tool.

I rarely print PDFs any more.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [COFF] Re: converting lousy scans of pdfs into something more useable
  2023-02-03 16:01 ` Bakul Shah
@ 2023-02-03 16:25   ` Will Senn
  0 siblings, 0 replies; 7+ messages in thread
From: Will Senn @ 2023-02-03 16:25 UTC (permalink / raw)
  To: Bakul Shah; +Cc: coff

[-- Attachment #1: Type: text/plain, Size: 1231 bytes --]

On 2/3/23 10:01 AM, Bakul Shah wrote:
>
> https://github.com/ocrmypdf/OCRmyPDF
>
> It's a python script that runs most any unix and uses
> tesseract. Its author's motivation seems similar to yours:
>
> I searched the web for a free command line tool to OCR PDF files: I found many, but none of them were really satisfying:
>      • Either they produced PDF files with misplaced text under the image (making copy/paste impossible)
>      • Or they did not handle accents and multilingual characters
>      • Or they changed the resolution of the embedded images
>      • Or they generated ridiculously large PDF files
>      • Or they crashed when trying to OCR
>      • Or they did not produce valid PDF files
>      • On top of that none of them produced PDF/A files (format dedicated for long time storage)
> ...so I decided to develop my own tool.

Nice. Off to checking out OCRmyPDF!

> I rarely print PDFs any more.

I can't seem to get away from having to highlight and mark up the stuff 
I read. I love pdf's searchability of words, but not for quickly 
locating a section, or just browsing and studying them. I can flip pages 
much faster with paper than an ebook it seems :).

-will

[-- Attachment #2: Type: text/html, Size: 1916 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [COFF] Re: converting lousy scans of pdfs into something more useable
  2023-02-03 15:27 [COFF] converting lousy scans of pdfs into something more useable Will Senn
  2023-02-03 16:00 ` [COFF] " Dennis Boone
  2023-02-03 16:01 ` Bakul Shah
@ 2023-02-04  7:59 ` Ralph Corderoy
  2 siblings, 0 replies; 7+ messages in thread
From: Ralph Corderoy @ 2023-02-04  7:59 UTC (permalink / raw)
  To: coff

> https://decuser.github.io/pdfs/2023/02/01/pdf-cleanup-workflow.html

units(1) can do the sum for you.

    $ units -1 475point in
	    * 6.5972222

Be aware there are several kinds of ‘point’; search
/usr/share/units/definitions.units for /^point to get to the relevant
area.

It will also do your simple ‘echo ... | bc -l’ sums.

    $ units 2560/96
            Definition: 26.666667

GraphicsMagick, gm(1), may be more consistent and future-proof than
ImageMagick.

> I'd really like to get away from needing Photoscape X, though.
> Then I could entirely automate the workflow in bash...

gmic(1) is a very powerful and little known image-processing program.
It may help dump PhotoScape X.

-- 
Cheers, Ralph.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [COFF] Re: converting lousy scans of pdfs into something more useable
@ 2023-02-03 17:09 Bakul Shah
  0 siblings, 0 replies; 7+ messages in thread
From: Bakul Shah @ 2023-02-03 17:09 UTC (permalink / raw)
  To: Will Senn; +Cc: coff

On Feb 3, 2023, at 8:26 AM, Will Senn <will.senn@gmail.com> wrote:
> 
> I can't seem to get away from having to highlight and mark up the stuff I read. I love pdf's searchability of words, but not for quickly locating a section, or just browsing and studying them. I can flip pages much faster with paper than an ebook it seems :).

You can annotate, highlight and markup pdfs. There are apps for that though
I’m not very familiar with them as I don’t markup even paper copies. On an
iPad you can easily annotate pdfs with an apple pencil. 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [COFF] Re: converting lousy scans of pdfs into something more, useable
       [not found] <167544017712.2485736.11108085155717490044@minnie.tuhs.org>
@ 2023-02-03 16:21 ` Will Senn
  0 siblings, 0 replies; 7+ messages in thread
From: Will Senn @ 2023-02-03 16:21 UTC (permalink / raw)
  To: coff


> From: Dennis Boone <drb@msu.edu>
>
> * Don't use JPEG 2000 and similar compression algorithms that try to
>    re-use blocks of pixels from elsewhere in the document -- too many
>    errors, and they're errors of the sort that can be critical.  Even if
>    the replacements use the correct code point, they're distracting as
>    hell in a different font, size, etc.
I wondered about why certain images were the way they were, this 
probably explains a lot.

> * OCR-under is good.  I use `ocrmypdf`, which uses the Tesseract engine.
Thanks for the tips.
> * Bookmarks for pages / table of contents entries / etc are mandatory.
>    Very few things make a scanned-doc PDF less useful than not being able
>    to skip directly to a document indicated page.
I wish. This is a tough one. I generally sacrifice ditching the 
bookmarks to make a better pdf. I need to look into extracting bookmarks 
and if they can be re-added without getting all wonky.

> * I like to see at least 300 dpi.
Yes, me too, but I've found that this often results in too big (when 
fixing existing), if I'm creating, they're fine.

> * Don't scan in color mode if the source material isn't color.  Grey
>    scale or even "line art" works fine in most cases.  Using one pixel
>    means you can use G4 compression for colorless pages.

Amen :).
>
> * Do reduce the color depth of pages that do contain color if you can.
>    The resulting PDF can contain a mix of image types.  I've worked with
>    documents that did use color where four or eight colors were enough,
>    and the whole document could be mapped to them.  With care, you _can_
>    force the scans down to two or three bits per pixel.
> * Do insert sensible metadata.
>
> * Do try to square up the inevitably crooked scans, clean up major
>    floobydust and whatever crud around the edges isn't part of the paper,
>    etc.  Besides making the result more readable, it'll help the OCR.  I
>    never have any luck with automated page orientation tooling for some
>    reason, so end up just doing this with Gimp.
Great points. Thanks.

-will


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-02-04  7:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-03 15:27 [COFF] converting lousy scans of pdfs into something more useable Will Senn
2023-02-03 16:00 ` [COFF] " Dennis Boone
2023-02-03 16:01 ` Bakul Shah
2023-02-03 16:25   ` Will Senn
2023-02-04  7:59 ` Ralph Corderoy
     [not found] <167544017712.2485736.11108085155717490044@minnie.tuhs.org>
2023-02-03 16:21 ` [COFF] Re: converting lousy scans of pdfs into something more, useable Will Senn
2023-02-03 17:09 [COFF] Re: converting lousy scans of pdfs into something more useable Bakul Shah

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).