From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=MAILING_LIST_MULTI autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 2884 invoked from network); 3 Feb 2023 16:00:18 -0000 Received: from minnie.tuhs.org (50.116.15.146) by inbox.vuxu.org with ESMTPUTF8; 3 Feb 2023 16:00:18 -0000 Received: from minnie.tuhs.org (localhost [IPv6:::1]) by minnie.tuhs.org (Postfix) with ESMTP id AA0F54125B; Sat, 4 Feb 2023 02:00:16 +1000 (AEST) Received: from yagi.h-net.msu.edu (yagi.h-net.msu.edu [35.9.18.40]) by minnie.tuhs.org (Postfix) with ESMTP id 50E7D41259 for ; Sat, 4 Feb 2023 02:00:11 +1000 (AEST) Received: by yagi.h-net.msu.edu (Postfix, from userid 1000) id A78264BBB5; Fri, 3 Feb 2023 11:00:10 -0500 (EST) References: To: coff From: Dennis Boone In-reply-to: (Your message of Fri, 03 Feb 2023 09:27:02 -0600.) MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <83981.1675440010.1@yagi.h-net.org> Date: Fri, 03 Feb 2023 11:00:10 -0500 Message-Id: <20230203160010.A78264BBB5@yagi.h-net.msu.edu> Message-ID-Hash: NBJMRZOGB73SIFZ4YW4VBCPKFE3KBWSY X-Message-ID-Hash: NBJMRZOGB73SIFZ4YW4VBCPKFE3KBWSY X-MailFrom: drb@yagi.h-net.msu.edu X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.6b1 Precedence: list Subject: [COFF] Re: converting lousy scans of pdfs into something more useable List-Id: Computer Old Farts Forum Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: > I read a tremendous number of documents from the web, or at least > read parts of them - to the tune of maybe 50 or so a week. It is > appalling to me in this era that we can't get better at scanning. Be > that as it may, the needle doesn't seem to have moved appreciably in > the last decade or so and it's a little sad. Sure, if folks print to > pdf, it's great. But, if they scan a doc, not so great, even today. I see a fair number of frustrating scanned-doc PDFs too. My thoughts on what constitutes a decent scan: * Assume people will print at least a few pages occasionally. It's often easier to print that one table or diagram and take it to the bench than to try to use a tablet or run back and forth to a PC. That affects how you think about creating the PDF. * Don't use JPEG 2000 and similar compression algorithms that try to re-use blocks of pixels from elsewhere in the document -- too many errors, and they're errors of the sort that can be critical. Even if the replacements use the correct code point, they're distracting as hell in a different font, size, etc. * OCR-under is good. I use `ocrmypdf`, which uses the Tesseract engine. * I do get angry when I see people trying to reconstruct the document via OCR and omitting the actual scan -- too many errors. * Bookmarks for pages / table of contents entries / etc are mandatory. Very few things make a scanned-doc PDF less useful than not being able to skip directly to a document indicated page. * I like to see at least 300 dpi. * Don't scan in color mode if the source material isn't color. Grey scale or even "line art" works fine in most cases. Using one pixel means you can use G4 compression for colorless pages. * Do reduce the color depth of pages that do contain color if you can. The resulting PDF can contain a mix of image types. I've worked with documents that did use color where four or eight colors were enough, and the whole document could be mapped to them. With care, you _can_ force the scans down to two or three bits per pixel. * Do insert sensible metadata. * Do try to square up the inevitably crooked scans, clean up major floobydust and whatever crud around the edges isn't part of the paper, etc. Besides making the result more readable, it'll help the OCR. I never have any luck with automated page orientation tooling for some reason, so end up just doing this with Gimp. Tuppence. De