public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: BP Jonsson <bpjonsson-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
Subject: Re: Is Pandoc OCR Capable
Date: Tue, 3 Jul 2018 23:27:58 +0200	[thread overview]
Message-ID: <CAFC_yuTL346rr=kCV_u+JJk6T=fG+ZmSWntmTftBZO-29no1jw@mail.gmail.com> (raw)
In-Reply-To: <cf83dc69-7e1c-44e2-92d0-003de00c4a96-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 2973 bytes --]

I convert PDF to DOCX with Google docs and then convert the DOCX to
Markdown with Pandoc. That saves a lot of work compared to converting PDF
to non-Markdown text and then add Markdown markup manually.

Den tis 3 jul 2018 14:08CR <chuckr69-Wuw85uim5zDR7s880joybQ@public.gmane.org> skrev:

>
> I just wanted to comment on PDF and OCR, since I do a bit of both.
>
> A PDF is just a package, a container. It can contain images, text, or
> both. If a PDF is created from Word, then it is normally easy to extract
> the text if there is no security. Extracting tables from a PDF is a
> different matter though.
>
> When Google was first scanning in public domain books they just scanned an
> image of each page and placed that into a PDF. These images (as a PDF) did
> poorly with OCR software for many reason: the page was crooked in the
> image, OCR doesn't do italics well, fonts were non-standard, etc. But some
> PDF images can be OCRd but they need a lot of cleanup. I do this on a
> regular basis.
>
> Google taking the quick and easy route in saving books in digital format,
> and the poor OCR results from Google books, is where https://www.pgdp.net/
> (from Project Gutenberg) fills a niche. These people actually type in the
> text from an image of each page, but it's a long multiple step process to
> finish the book.
>
> https://www.online-convert.com is free and has the option to extract text
> in a PDF, and do OCR on images in the text if it's from Google Docs. It
> supports almost unlimited pages and is the best free option I've found so
> far for converting PDF to plain text (which I then convert to Markdown then
> EPUB).
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/cf83dc69-7e1c-44e2-92d0-003de00c4a96%40googlegroups.com
> <https://groups.google.com/d/msgid/pandoc-discuss/cf83dc69-7e1c-44e2-92d0-003de00c4a96%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFC_yuTL346rr%3DkCV_u%2BJJk6T%3DfG%2BZmSWntmTftBZO-29no1jw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 4263 bytes --]

      parent reply	other threads:[~2018-07-03 21:27 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-25 10:47 Wei Wooi Peh
     [not found] ` <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-06-25 12:26   ` Robert Zenz
     [not found]     ` <5B30DF86.8020307-q1xk7osDwJUWQnjQ7V0W7w@public.gmane.org>
2018-06-25 13:36       ` Wei Wooi Peh
     [not found]         ` <311aad2d-67b2-4b41-be6c-8b23b1054b81-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-06-25 13:46           ` Robert Zenz
2018-06-26 12:46   ` Wei Wooi Peh
     [not found]     ` <ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-06-26 12:51       ` Robert Zenz
2018-06-26 12:58       ` Paulo Ney de Souza
     [not found]         ` <CAFVhNZNEq6SjiPihWiRrdstJ5BM4RUUBNZs2sjkTyWTRMsHeXQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-06-26 13:14           ` Wei Wooi Peh
     [not found]             ` <CAAcdRmn8na1nCexX16pWoNv22m8r+KAnERoeMA7gDqWVRSrXVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-06-26 13:17               ` Robert Zenz
2018-06-26 21:00               ` Jeremy Theler
     [not found]                 ` <07a5f522ae7e34890861a20b74f0fa034f640b63.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org>
2018-06-27  3:28                   ` Ivan Lazar Miljenovic
2018-06-26 13:25   ` Eduardo Grosclaude
2018-06-26 13:31   ` Eduardo Grosclaude
     [not found]     ` <9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-06-26 13:41       ` Wei Wooi Peh
     [not found]         ` <CAAcdRmnLizCoq2Y8_x-M7nqb7R=D_WpHXAzdi1nz5S+vtXwY4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-06-26 13:45           ` Paulo Ney de Souza
2018-06-26 14:03           ` Pablo Rodríguez
2018-06-28 12:36   ` Christophe Demko
     [not found]     ` <6ca16ea7-1a04-4c2e-8856-e75b5465d853-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-06-28 12:51       ` Jeremy Theler
     [not found]         ` <e0252af0901d3a6a413f35a4f5c008ad8a13c41d.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org>
2018-06-28 13:51           ` Shawn H Corey
2018-06-29 15:14           ` Christophe Demko
2018-07-03 12:08   ` CR
     [not found]     ` <cf83dc69-7e1c-44e2-92d0-003de00c4a96-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-07-03 21:27       ` BP Jonsson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAFC_yuTL346rr=kCV_u+JJk6T=fG+ZmSWntmTftBZO-29no1jw@mail.gmail.com' \
    --to=bpjonsson-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).