[TUHS] tesseract has gotten so much better

The Unix Heritage Society mailing list
 help / color / mirror / Atom feed

From: Will Senn <will.senn@gmail.com>
To: TUHS <tuhs@tuhs.org>
Subject: [TUHS] tesseract has gotten so much better
Date: Sat, 28 Jan 2023 09:05:38 -0600	[thread overview]
Message-ID: <245759ab-c8de-f616-8942-8d7d193a960a@gmail.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 1574 bytes --]

Hi All,

I just wanted to let y'all know that tesseract ocr has significantly 
improved and is much easier to use that it used to be. I have been using 
it with my workflow for a bit and it's crazy how much better it is than 
it was back when I  tried it last (admittedly 5-6 years ago). For those 
of you doing your own scans, or those of you finding sad little pdfs 
without ocr, the process is fairly simple.

Let's say you find "The Master Manual of Fortran.pdf" out there in the 
wild (or scan it). Here's how to turn it into a glorious ocr'd version:

Export your pdf as a multi-image tiff - it'll be ginormous, but you can 
delete it later (on Mac, this is just export from preview and select 
tiff, but gs will do it to, if I remember correctly) and then:

tesseract The\ Master\ Manual\ of\ Fortran.tiff out -l eng PDF

et voila, I nice, if large pdf, called out.pdf or somesuch will appear 
with ocr text that actually matches your scan (it seems to have caught 
up to adobe's ocr, or is quite close in my view, ymmv).

I speak English, so I installed tesseract and tesseract-eng, but it 
supports a bunch of other languages if you need them. Apparently 
google's been supporting and developing it for while now and if my 
results are any indicator, it's paying off (boy do I remember all the 
gobbledegook it used to produce).

tesseract will import from different image types, multiple images, etc. 
I just like the simplicity of tiff->pdf.

Anyhow, thought y'all might like to know as many of you live off the 
scans :).

Will

[-- Attachment #2: Type: text/html, Size: 2064 bytes --]

                 reply	other threads:[~2023-01-28 15:07 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=245759ab-c8de-f616-8942-8d7d193a960a@gmail.com \
    --to=will.senn@gmail.com \
    --cc=tuhs@tuhs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).