From: BP Jonsson <bpjonsson-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
Subject: Re: Is Pandoc OCR Capable
Date: Tue, 3 Jul 2018 23:27:58 +0200 [thread overview]
Message-ID: <CAFC_yuTL346rr=kCV_u+JJk6T=fG+ZmSWntmTftBZO-29no1jw@mail.gmail.com> (raw)
In-Reply-To: <cf83dc69-7e1c-44e2-92d0-003de00c4a96-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
[-- Attachment #1: Type: text/plain, Size: 2973 bytes --]
I convert PDF to DOCX with Google docs and then convert the DOCX to
Markdown with Pandoc. That saves a lot of work compared to converting PDF
to non-Markdown text and then add Markdown markup manually.
Den tis 3 jul 2018 14:08CR <chuckr69-Wuw85uim5zDR7s880joybQ@public.gmane.org> skrev:
>
> I just wanted to comment on PDF and OCR, since I do a bit of both.
>
> A PDF is just a package, a container. It can contain images, text, or
> both. If a PDF is created from Word, then it is normally easy to extract
> the text if there is no security. Extracting tables from a PDF is a
> different matter though.
>
> When Google was first scanning in public domain books they just scanned an
> image of each page and placed that into a PDF. These images (as a PDF) did
> poorly with OCR software for many reason: the page was crooked in the
> image, OCR doesn't do italics well, fonts were non-standard, etc. But some
> PDF images can be OCRd but they need a lot of cleanup. I do this on a
> regular basis.
>
> Google taking the quick and easy route in saving books in digital format,
> and the poor OCR results from Google books, is where https://www.pgdp.net/
> (from Project Gutenberg) fills a niche. These people actually type in the
> text from an image of each page, but it's a long multiple step process to
> finish the book.
>
> https://www.online-convert.com is free and has the option to extract text
> in a PDF, and do OCR on images in the text if it's from Google Docs. It
> supports almost unlimited pages and is the best free option I've found so
> far for converting PDF to plain text (which I then convert to Markdown then
> EPUB).
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/cf83dc69-7e1c-44e2-92d0-003de00c4a96%40googlegroups.com
> <https://groups.google.com/d/msgid/pandoc-discuss/cf83dc69-7e1c-44e2-92d0-003de00c4a96%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFC_yuTL346rr%3DkCV_u%2BJJk6T%3DfG%2BZmSWntmTftBZO-29no1jw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
[-- Attachment #2: Type: text/html, Size: 4263 bytes --]
prev parent reply other threads:[~2018-07-03 21:27 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-06-25 10:47 Wei Wooi Peh
[not found] ` <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-06-25 12:26 ` Robert Zenz
[not found] ` <5B30DF86.8020307-q1xk7osDwJUWQnjQ7V0W7w@public.gmane.org>
2018-06-25 13:36 ` Wei Wooi Peh
[not found] ` <311aad2d-67b2-4b41-be6c-8b23b1054b81-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-06-25 13:46 ` Robert Zenz
2018-06-26 12:46 ` Wei Wooi Peh
[not found] ` <ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-06-26 12:51 ` Robert Zenz
2018-06-26 12:58 ` Paulo Ney de Souza
[not found] ` <CAFVhNZNEq6SjiPihWiRrdstJ5BM4RUUBNZs2sjkTyWTRMsHeXQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-06-26 13:14 ` Wei Wooi Peh
[not found] ` <CAAcdRmn8na1nCexX16pWoNv22m8r+KAnERoeMA7gDqWVRSrXVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-06-26 13:17 ` Robert Zenz
2018-06-26 21:00 ` Jeremy Theler
[not found] ` <07a5f522ae7e34890861a20b74f0fa034f640b63.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org>
2018-06-27 3:28 ` Ivan Lazar Miljenovic
2018-06-26 13:25 ` Eduardo Grosclaude
2018-06-26 13:31 ` Eduardo Grosclaude
[not found] ` <9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-06-26 13:41 ` Wei Wooi Peh
[not found] ` <CAAcdRmnLizCoq2Y8_x-M7nqb7R=D_WpHXAzdi1nz5S+vtXwY4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-06-26 13:45 ` Paulo Ney de Souza
2018-06-26 14:03 ` Pablo Rodríguez
2018-06-28 12:36 ` Christophe Demko
[not found] ` <6ca16ea7-1a04-4c2e-8856-e75b5465d853-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-06-28 12:51 ` Jeremy Theler
[not found] ` <e0252af0901d3a6a413f35a4f5c008ad8a13c41d.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org>
2018-06-28 13:51 ` Shawn H Corey
2018-06-29 15:14 ` Christophe Demko
2018-07-03 12:08 ` CR
[not found] ` <cf83dc69-7e1c-44e2-92d0-003de00c4a96-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-07-03 21:27 ` BP Jonsson [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAFC_yuTL346rr=kCV_u+JJk6T=fG+ZmSWntmTftBZO-29no1jw@mail.gmail.com' \
--to=bpjonsson-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
--cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).