From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_INVALID,DKIM_SIGNED, MAILING_LIST_MULTI autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 3285 invoked from network); 3 Feb 2023 16:03:27 -0000 Received: from minnie.tuhs.org (50.116.15.146) by inbox.vuxu.org with ESMTPUTF8; 3 Feb 2023 16:03:27 -0000 Received: from minnie.tuhs.org (localhost [IPv6:::1]) by minnie.tuhs.org (Postfix) with ESMTP id 030034125F; Sat, 4 Feb 2023 02:02:56 +1000 (AEST) Received: from mail-io1-f52.google.com (mail-io1-f52.google.com [209.85.166.52]) by minnie.tuhs.org (Postfix) with ESMTPS id 52F934125C for ; Sat, 4 Feb 2023 02:02:50 +1000 (AEST) Received: by mail-io1-f52.google.com with SMTP id j17so237720ioa.9 for ; Fri, 03 Feb 2023 08:02:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=iitbombay-org.20210112.gappssmtp.com; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=YzCtCQC6Uv/q26RgtwZGKN3msm+9w3lINzqbd9tSfPc=; b=HyC5dVq+NYxvHAfET/4odJ2d/pG3fmMjL1aCK87JfVbOyJl5b4d0mbCv+ceMVM03qn IGL5cx0bgjSW5lP+DHSX3IgoImyWeVVAF758whq3IP/zQ6QLaFb0DYn+ebFkCfViD0bX 6FDsN12Q18YtUvFyU9QXfsKLVLTUorrQ1rKgbRW6BIA0blWDAcYFto/53BK0sNov7ySu 8Z6mCwLP4FOOIEold/8Ago6kNvZNna+j25Q6/IxkNjDjmYl/SRYZtVclqoMMizgyetJC JvpgEU6klJtBh3L9DZDJM8YW/F66BK1DMa2KsQ7nAEus1QQoQ+KHyXb9skXBs64zM7MF +CLg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=YzCtCQC6Uv/q26RgtwZGKN3msm+9w3lINzqbd9tSfPc=; b=Rtk+qMJb5d9CCfUgsZVf7HorYuoQPJQbKjlvBfrBAzZuSCqvN96O1+vL7B69LRk06z qWNnT92Iuk41cfwczbb6b9XlBg//CEASQCu5/+gubQJOZaUznt/DZiLGDj+qcOuIL7nq 1ThLzBqBbLi9SkVguEA2r3XQF8BSkOowkc02P02t5lpC7ddupKwJ+zldEbdaDVncGLR/ sCrn4FiWfXX4X+IZc/u6CVCR6Cz6wo8B2qYscDzu7oMAJsEeFXVc8RGPP3l9/IfUJ5T+ YCVUK4noKjyLzBot1Wcjzi7p686oXPif4WiMaDF79XFTg7zyK83DYjBwrtaQCClWYoEr R1Uw== X-Gm-Message-State: AO0yUKW3mb6aPM5Kx3zKzwJpUca4ndcKZLAwR2ZHdN9VchjXGYT9FmIp vhBYudmjz4rblvl5wKcGNX7GI9EmQv2SIaUl X-Google-Smtp-Source: AK7set+D66PbPRcutXzWzUwVkhKwoQITw1nOYNX2dZdd/pEGQruDy5DqKcfImAOC9sEmBq/XU4vHRg== X-Received: by 2002:a6b:e009:0:b0:6de:13e4:69e7 with SMTP id z9-20020a6be009000000b006de13e469e7mr7204518iog.7.1675440109567; Fri, 03 Feb 2023 08:01:49 -0800 (PST) Received: from smtpclient.apple (107-215-223-229.lightspeed.sntcca.sbcglobal.net. [107.215.223.229]) by smtp.gmail.com with ESMTPSA id w4-20020a056638138400b0039e07ca9ae5sm915269jad.113.2023.02.03.08.01.48 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 03 Feb 2023 08:01:49 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.400.51.1.1\)) From: Bakul Shah In-Reply-To: Date: Fri, 3 Feb 2023 08:01:37 -0800 Content-Transfer-Encoding: quoted-printable Message-Id: <8B9903EF-9F35-4810-85DD-F8629EC67973@iitbombay.org> References: To: Will Senn X-Mailer: Apple Mail (2.3731.400.51.1.1) Message-ID-Hash: KYYAUPZL44NHVC4RKUIYIWWGSJQHGU2J X-Message-ID-Hash: KYYAUPZL44NHVC4RKUIYIWWGSJQHGU2J X-MailFrom: bakul@iitbombay.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: coff X-Mailman-Version: 3.3.6b1 Precedence: list Subject: [COFF] Re: converting lousy scans of pdfs into something more useable List-Id: Computer Old Farts Forum Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Feb 3, 2023, at 7:27 AM, Will Senn wrote: >=20 > what's your experience with using sad pdfs? Do you just live with them = as they are, or do you try to fix them and how, or do you use a workflow = and get good results? Usually I just live with them but I may use "ocrmypdf" if search or copy-paste is unsatisfactory. https://github.com/ocrmypdf/OCRmyPDF It's a python script that runs most any unix and uses tesseract. Its author's motivation seems similar to yours: I searched the web for a free command line tool to OCR PDF files: I = found many, but none of them were really satisfying: =E2=80=A2 Either they produced PDF files with misplaced text under = the image (making copy/paste impossible) =E2=80=A2 Or they did not handle accents and multilingual characters =E2=80=A2 Or they changed the resolution of the embedded images =E2=80=A2 Or they generated ridiculously large PDF files =E2=80=A2 Or they crashed when trying to OCR =E2=80=A2 Or they did not produce valid PDF files =E2=80=A2 On top of that none of them produced PDF/A files (format = dedicated for long time storage) ...so I decided to develop my own tool. I rarely print PDFs any more.=