Computer Old Farts Forum
 help / color / mirror / Atom feed
From: Dennis Boone <drb@msu.edu>
To: coff <coff@tuhs.org>
Subject: [COFF] Re: converting lousy scans of pdfs into something more useable
Date: Fri, 03 Feb 2023 11:00:10 -0500	[thread overview]
Message-ID: <20230203160010.A78264BBB5@yagi.h-net.msu.edu> (raw)
In-Reply-To: (Your message of Fri, 03 Feb 2023 09:27:02 -0600.) <bafc6b83-d14b-1a68-5480-23ad75ef22f4@gmail.com>

 > I read a tremendous number of documents from the web, or at least
 > read parts of them - to the tune of maybe 50 or so a week. It is
 > appalling to me in this era that we can't get better at scanning. Be
 > that as it may, the needle doesn't seem to have moved appreciably in
 > the last decade or so and it's a little sad. Sure, if folks print to
 > pdf, it's great. But, if they scan a doc, not so great, even today.

I see a fair number of frustrating scanned-doc PDFs too.  My thoughts on
what constitutes a decent scan:

* Assume people will print at least a few pages occasionally.  It's
  often easier to print that one table or diagram and take it to the
  bench than to try to use a tablet or run back and forth to a PC.  That
  affects how you think about creating the PDF.

* Don't use JPEG 2000 and similar compression algorithms that try to
  re-use blocks of pixels from elsewhere in the document -- too many
  errors, and they're errors of the sort that can be critical.  Even if
  the replacements use the correct code point, they're distracting as
  hell in a different font, size, etc.

* OCR-under is good.  I use `ocrmypdf`, which uses the Tesseract engine.

* I do get angry when I see people trying to reconstruct the document
  via OCR and omitting the actual scan -- too many errors.

* Bookmarks for pages / table of contents entries / etc are mandatory.
  Very few things make a scanned-doc PDF less useful than not being able
  to skip directly to a document indicated page.

* I like to see at least 300 dpi.

* Don't scan in color mode if the source material isn't color.  Grey
  scale or even "line art" works fine in most cases.  Using one pixel
  means you can use G4 compression for colorless pages.

* Do reduce the color depth of pages that do contain color if you can.
  The resulting PDF can contain a mix of image types.  I've worked with
  documents that did use color where four or eight colors were enough,
  and the whole document could be mapped to them.  With care, you _can_
  force the scans down to two or three bits per pixel.

* Do insert sensible metadata.

* Do try to square up the inevitably crooked scans, clean up major
  floobydust and whatever crud around the edges isn't part of the paper,
  etc.  Besides making the result more readable, it'll help the OCR.  I
  never have any luck with automated page orientation tooling for some
  reason, so end up just doing this with Gimp.

Tuppence.

De

  reply	other threads:[~2023-02-03 16:00 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-03 15:27 [COFF] " Will Senn
2023-02-03 16:00 ` Dennis Boone [this message]
2023-02-03 16:01 ` [COFF] " Bakul Shah
2023-02-03 16:25   ` Will Senn
2023-02-04  7:59 ` Ralph Corderoy
     [not found] <167544017712.2485736.11108085155717490044@minnie.tuhs.org>
2023-02-03 16:21 ` [COFF] Re: converting lousy scans of pdfs into something more, useable Will Senn
2023-02-03 17:09 [COFF] Re: converting lousy scans of pdfs into something more useable Bakul Shah

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230203160010.A78264BBB5@yagi.h-net.msu.edu \
    --to=drb@msu.edu \
    --cc=coff@tuhs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).