From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FROM,MAILING_LIST_MULTI autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 5776 invoked from network); 3 Feb 2023 16:23:10 -0000 Received: from minnie.tuhs.org (50.116.15.146) by inbox.vuxu.org with ESMTPUTF8; 3 Feb 2023 16:23:10 -0000 Received: from minnie.tuhs.org (localhost [IPv6:::1]) by minnie.tuhs.org (Postfix) with ESMTP id C709541262; Sat, 4 Feb 2023 02:22:38 +1000 (AEST) Received: from mail-ot1-f46.google.com (mail-ot1-f46.google.com [209.85.210.46]) by minnie.tuhs.org (Postfix) with ESMTPS id 1DDF741261 for ; Sat, 4 Feb 2023 02:22:34 +1000 (AEST) Received: by mail-ot1-f46.google.com with SMTP id p24-20020a056830131800b0068d4b30536aso1488240otq.9 for ; Fri, 03 Feb 2023 08:22:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=KbG9g+puUSiilndS7AB2RrDegfoZCetYOsWty7BX6IM=; b=a4u1gpr0aH4oHSPbcTryXOfe5oz3/Y2aAdEli+dg7hHfwSXcn65Dfr750zK5/cy1h2 RVOPRU90Ub/hvGd7hYMcr2BeIUCYpBhejpF9Fd9dpQS9Ir2R1/P1a2D++sAvtHelvJU8 Cb8mbJYxRCmBCRKlWryM2gMIN9MB5m9/uUh+DTKjmXoQlc0gfXv9u1BMubeoydJcWIFZ zjFs1Kza6PT+I9GTQUPmobqYWj/+tiXAp836pmxMZgLDCtR+h9lle8d54x/FaDQcW4cc 25Ui0vK8GQ0o02dXEmiQzQZSmTWGbG9gwaWiF8Zc9oVWEqgcXjpUWGp8McaII4gZmGjG qfuQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=KbG9g+puUSiilndS7AB2RrDegfoZCetYOsWty7BX6IM=; b=3i7k4OtElRTE70RGNKfcZTXO/F3ck0qjEMdFjjD1+VV1P5+qEtVdhzksVXpY7+Vj9c 6BmLvPBDPDj/YlHDnM1ik+rLqgj3AE3Tx7xOMFBqnYn++av3LWmALLgXdUMhJmRhIlOG 9ZSoy8yQhkyAjn353MKSQ8zZfJ7CgnRmTiWV6LJsXyx64cO3pPnd04IVHXdZfcA3kFQd FbyCHIsBG9ZGBOnKEdmYL7g+Dy2m8P+WNe/OUBIkSuTIeMFsWovNfO/m6kOADuLJBsrU vZ4/TT+gmwbssB9nMcpQWQyuLZrRi7iPwD13CWmgW94p8utS6Jcu5KsjOA1jlL2At9yN HpyQ== X-Gm-Message-State: AO0yUKV+8TzZhwM9RtwraYESBeGyayQtsbY41B7ujbSYfaYEbLwrodMl p7yXcX2PcF/hRAv8KLU1/Vqq8orexos= X-Google-Smtp-Source: AK7set8nMC8U+BPBgV6TZMcanowHbzyfgj4GkZrN7B1BJIjbmNJxUCFFn8ObWXaqDFlCMpGzfj04ew== X-Received: by 2002:a9d:325:0:b0:68d:3d61:aff1 with SMTP id 34-20020a9d0325000000b0068d3d61aff1mr5784837otv.14.1675441293154; Fri, 03 Feb 2023 08:21:33 -0800 (PST) Received: from [192.168.254.25] (h201.47.20.98.static.ip.windstream.net. [98.20.47.201]) by smtp.gmail.com with ESMTPSA id h12-20020a056830034c00b0068bce6239a3sm1264608ote.38.2023.02.03.08.21.32 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 03 Feb 2023 08:21:32 -0800 (PST) Message-ID: Date: Fri, 3 Feb 2023 10:21:31 -0600 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:102.0) Gecko/20100101 Thunderbird/102.6.1 Content-Language: en-US To: coff@tuhs.org References: <167544017712.2485736.11108085155717490044@minnie.tuhs.org> From: Will Senn In-Reply-To: <167544017712.2485736.11108085155717490044@minnie.tuhs.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Message-ID-Hash: 3B7LHNO2LSZ426X32WPPDWJZN4OQTPDF X-Message-ID-Hash: 3B7LHNO2LSZ426X32WPPDWJZN4OQTPDF X-MailFrom: will.senn@gmail.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.6b1 Precedence: list Subject: [COFF] Re: converting lousy scans of pdfs into something more, useable List-Id: Computer Old Farts Forum Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: > From: Dennis Boone > > * Don't use JPEG 2000 and similar compression algorithms that try to > re-use blocks of pixels from elsewhere in the document -- too many > errors, and they're errors of the sort that can be critical. Even if > the replacements use the correct code point, they're distracting as > hell in a different font, size, etc. I wondered about why certain images were the way they were, this probably explains a lot. > * OCR-under is good. I use `ocrmypdf`, which uses the Tesseract engine. Thanks for the tips. > * Bookmarks for pages / table of contents entries / etc are mandatory. > Very few things make a scanned-doc PDF less useful than not being able > to skip directly to a document indicated page. I wish. This is a tough one. I generally sacrifice ditching the bookmarks to make a better pdf. I need to look into extracting bookmarks and if they can be re-added without getting all wonky. > * I like to see at least 300 dpi. Yes, me too, but I've found that this often results in too big (when fixing existing), if I'm creating, they're fine. > * Don't scan in color mode if the source material isn't color. Grey > scale or even "line art" works fine in most cases. Using one pixel > means you can use G4 compression for colorless pages. Amen :). > > * Do reduce the color depth of pages that do contain color if you can. > The resulting PDF can contain a mix of image types. I've worked with > documents that did use color where four or eight colors were enough, > and the whole document could be mapped to them. With care, you _can_ > force the scans down to two or three bits per pixel. > * Do insert sensible metadata. > > * Do try to square up the inevitably crooked scans, clean up major > floobydust and whatever crud around the edges isn't part of the paper, > etc. Besides making the result more readable, it'll help the OCR. I > never have any luck with automated page orientation tooling for some > reason, so end up just doing this with Gimp. Great points. Thanks. -will