Den tis 3 jul 2018 14:08CR <chuckr69-Wuw85uim5zDR7s880joybQ@public.gmane.org> skrev:

I just wanted to comment on PDF and OCR, since I do a bit of both.

A PDF is just a package, a container. It can contain images, text, or both. If a PDF is created from Word, then it is normally easy to extract the text if there is no security. Extracting tables from a PDF is a different matter though.

When Google was first scanning in public domain books they just scanned an image of each page and placed that into a PDF. These images (as a PDF) did poorly with OCR software for many reason: the page was crooked in the image, OCR doesn't do italics well, fonts were non-standard, etc. But some PDF images can be OCRd but they need a lot of cleanup. I do this on a regular basis.

Google taking the quick and easy route in saving books in digital format, and the poor OCR results from Google books, is where https://www.pgdp.net/ (from Project Gutenberg) fills a niche. These people actually type in the text from an image of each page, but it's a long multiple step process to finish the book.

https://www.online-convert.com is free and has the option to extract text in a PDF, and do OCR on images in the text if it's from Google Docs. It supports almost unlimited pages and is the best free option I've found so far for converting PDF to plain text (which I then convert to Markdown then EPUB).

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh4Ykp1iOSErHA@public.gmane.orgm.
To post to this group, send email to pandoc-discuss@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/cf83dc69-7e1c-44e2-92d0-003de00c4a96%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.