* Is Pandoc OCR Capable @ 2018-06-25 10:47 Wei Wooi Peh [not found] ` <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 0 siblings, 1 reply; 22+ messages in thread From: Wei Wooi Peh @ 2018-06-25 10:47 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 747 bytes --] Hi All, Good day. Would like to confirm, If I wish to convert my markdow to pdf format, will the pdf created is OCR capable? Please advise, thanks. Best regards, WeiWooi -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/c375467c-7686-4111-9b92-778b7d8f3b59%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #1.2: Type: text/html, Size: 1264 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: Is Pandoc OCR Capable [not found] ` <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2018-06-25 12:26 ` Robert Zenz [not found] ` <5B30DF86.8020307-q1xk7osDwJUWQnjQ7V0W7w@public.gmane.org> 2018-06-26 12:46 ` Wei Wooi Peh ` (4 subsequent siblings) 5 siblings, 1 reply; 22+ messages in thread From: Robert Zenz @ 2018-06-25 12:26 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw I think you got your terms mixed up. Most of the time OCR means "Optical Character Recognition" and means to extract characters from an image. But I guess you wanted to know whether the text in a PDF generated by Pandoc is treated as text by the PDF viewer, the answer is, it depends. I can only speak for wkhtmltopdf as driver, and the answer is yes. On 25.06.2018 12:47, Wei Wooi Peh wrote: > Hi All, > > Good day. > > Would like to confirm, If I wish to convert my markdow to pdf format, will > the pdf created is OCR capable? > > Please advise, thanks. > > Best regards, > WeiWooi > ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <5B30DF86.8020307-q1xk7osDwJUWQnjQ7V0W7w@public.gmane.org>]
* Re: Is Pandoc OCR Capable [not found] ` <5B30DF86.8020307-q1xk7osDwJUWQnjQ7V0W7w@public.gmane.org> @ 2018-06-25 13:36 ` Wei Wooi Peh [not found] ` <311aad2d-67b2-4b41-be6c-8b23b1054b81-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 0 siblings, 1 reply; 22+ messages in thread From: Wei Wooi Peh @ 2018-06-25 13:36 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 1445 bytes --] Hi Robert Zenz, Appreciate your prompt feedback. Do you know if TexLive or MikTex will have this capabilities also? Thanks Best regards, WeiWooi On Monday, June 25, 2018 at 8:26:56 PM UTC+8, Robert Zenz wrote: > > I think you got your terms mixed up. Most of the time OCR means "Optical > Character Recognition" and means to extract characters from an image. > > But I guess you wanted to know whether the text in a PDF generated by > Pandoc is > treated as text by the PDF viewer, the answer is, it depends. I can only > speak > for wkhtmltopdf as driver, and the answer is yes. > > > On 25.06.2018 12:47, Wei Wooi Peh wrote: > > Hi All, > > > > Good day. > > > > Would like to confirm, If I wish to convert my markdow to pdf format, > will > > the pdf created is OCR capable? > > > > Please advise, thanks. > > > > Best regards, > > WeiWooi > > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/311aad2d-67b2-4b41-be6c-8b23b1054b81%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #1.2: Type: text/html, Size: 2156 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <311aad2d-67b2-4b41-be6c-8b23b1054b81-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: Is Pandoc OCR Capable [not found] ` <311aad2d-67b2-4b41-be6c-8b23b1054b81-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2018-06-25 13:46 ` Robert Zenz 0 siblings, 0 replies; 22+ messages in thread From: Robert Zenz @ 2018-06-25 13:46 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw I don't know but I guess so, everything else would make little sense to me (why would a text processor embed the text as image?). On 25.06.2018 15:36, Wei Wooi Peh wrote: > Hi Robert Zenz, > > Appreciate your prompt feedback. Do you know if TexLive or MikTex will have > this capabilities also? > > Thanks > > Best regards, > WeiWooi > > On Monday, June 25, 2018 at 8:26:56 PM UTC+8, Robert Zenz wrote: >> >> I think you got your terms mixed up. Most of the time OCR means "Optical >> Character Recognition" and means to extract characters from an image. >> >> But I guess you wanted to know whether the text in a PDF generated by >> Pandoc is >> treated as text by the PDF viewer, the answer is, it depends. I can only >> speak >> for wkhtmltopdf as driver, and the answer is yes. >> >> >> On 25.06.2018 12:47, Wei Wooi Peh wrote: >>> Hi All, >>> >>> Good day. >>> >>> Would like to confirm, If I wish to convert my markdow to pdf format, >> will >>> the pdf created is OCR capable? >>> >>> Please advise, thanks. >>> >>> Best regards, >>> WeiWooi >>> > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is Pandoc OCR Capable [not found] ` <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2018-06-25 12:26 ` Robert Zenz @ 2018-06-26 12:46 ` Wei Wooi Peh [not found] ` <ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2018-06-26 13:25 ` Eduardo Grosclaude ` (3 subsequent siblings) 5 siblings, 1 reply; 22+ messages in thread From: Wei Wooi Peh @ 2018-06-26 12:46 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 956 bytes --] I have tried with wkhtmltopdf....It seems like is not OCR capable after I converted it from markdown to pdf. Any idea? On Monday, June 25, 2018 at 6:47:56 PM UTC+8, Wei Wooi Peh wrote: > > Hi All, > > Good day. > > Would like to confirm, If I wish to convert my markdow to pdf format, will > the pdf created is OCR capable? > > Please advise, thanks. > > Best regards, > WeiWooi > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #1.2: Type: text/html, Size: 1605 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: Is Pandoc OCR Capable [not found] ` <ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2018-06-26 12:51 ` Robert Zenz 2018-06-26 12:58 ` Paulo Ney de Souza 1 sibling, 0 replies; 22+ messages in thread From: Robert Zenz @ 2018-06-26 12:51 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw I've tried with SumatraPDF, Google Chrome and Adobe Acrobat Reader and all three allow me to select text in the generated PDF. What are you trying? On 26.06.2018 14:46, Wei Wooi Peh wrote: > I have tried with wkhtmltopdf....It seems like is not OCR capable after I > converted it from markdown to pdf. Any idea? > > On Monday, June 25, 2018 at 6:47:56 PM UTC+8, Wei Wooi Peh wrote: >> >> Hi All, >> >> Good day. >> >> Would like to confirm, If I wish to convert my markdow to pdf format, will >> the pdf created is OCR capable? >> >> Please advise, thanks. >> >> Best regards, >> WeiWooi >> > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is Pandoc OCR Capable [not found] ` <ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2018-06-26 12:51 ` Robert Zenz @ 2018-06-26 12:58 ` Paulo Ney de Souza [not found] ` <CAFVhNZNEq6SjiPihWiRrdstJ5BM4RUUBNZs2sjkTyWTRMsHeXQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 22+ messages in thread From: Paulo Ney de Souza @ 2018-06-26 12:58 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw [-- Attachment #1: Type: text/plain, Size: 2531 bytes --] Let me see if I can clarify the situation a bit. First Wei Wooi, please be clear on what you mean by being "OCR capable". There are many contexts possible here. For example, if you have an image - with a page of text - and that is included in HTML and converted to PDF -- Pandoc will NOT read (or OCR your image) to change it into a PDF with text. Making Robert Zenz second previous message - TeX in general (any of the engines including all of MikTeX and TeXLive) DO preserve all text into the PDF file -- if you use DEFAULT settings. It is obviously possible to prevent TeX from preserving the text. So please send us a brief example of what you are doing, so we can better help you. Paulo Ney On Tue, Jun 26, 2018 at 5:46 AM Wei Wooi Peh <pehweiwooi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > I have tried with wkhtmltopdf....It seems like is not OCR capable after I > converted it from markdown to pdf. Any idea? > > On Monday, June 25, 2018 at 6:47:56 PM UTC+8, Wei Wooi Peh wrote: >> >> Hi All, >> >> Good day. >> >> Would like to confirm, If I wish to convert my markdow to pdf format, >> will the pdf created is OCR capable? >> >> Please advise, thanks. >> >> Best regards, >> WeiWooi >> > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457%40googlegroups.com > <https://groups.google.com/d/msgid/pandoc-discuss/ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZNEq6SjiPihWiRrdstJ5BM4RUUBNZs2sjkTyWTRMsHeXQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #2: Type: text/html, Size: 3790 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <CAFVhNZNEq6SjiPihWiRrdstJ5BM4RUUBNZs2sjkTyWTRMsHeXQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Is Pandoc OCR Capable [not found] ` <CAFVhNZNEq6SjiPihWiRrdstJ5BM4RUUBNZs2sjkTyWTRMsHeXQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2018-06-26 13:14 ` Wei Wooi Peh [not found] ` <CAAcdRmn8na1nCexX16pWoNv22m8r+KAnERoeMA7gDqWVRSrXVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 22+ messages in thread From: Wei Wooi Peh @ 2018-06-26 13:14 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw [-- Attachment #1.1: Type: text/plain, Size: 4244 bytes --] Hi Robert Zenz and Pauolo Ney, Appreciate your prompt response. Maybe let me try to give an example 1. I created a document by using Markdown language and written some statement there about my flowchart introduction 2. To make it more interesting, I have added in the figure table, hence I cut and paste (or import) certain images (.png or .jpg format) into my markdown document. If this is in image format, we are not able to select/scan any text from the flowchart below ![image](media/figure1.jpg) 3. What I need is when we converted my entire markdown document into pdf format, it will convert my images (.png or .jpg) become OCR cable as well. Hopefully with this explanation will make my question much more clearer. Thanks Best regards, WeiWooi On Tue, Jun 26, 2018 at 8:58 PM, Paulo Ney de Souza <pauloney-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > Let me see if I can clarify the situation a bit. First Wei Wooi, please be > clear on what you mean by being "OCR capable". There are many contexts > possible here. For example, if you have an image - with a page of text - > and that is included in HTML and converted to PDF -- Pandoc will NOT read > (or OCR your image) to change it into a PDF with text. > > Making Robert Zenz second previous message - TeX in general (any of the > engines including all of MikTeX and TeXLive) DO preserve all text into the > PDF file -- if you use DEFAULT settings. It is obviously possible to > prevent TeX from preserving the text. > > So please send us a brief example of what you are doing, so we can better > help you. > > Paulo Ney > > > On Tue, Jun 26, 2018 at 5:46 AM Wei Wooi Peh <pehweiwooi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > >> I have tried with wkhtmltopdf....It seems like is not OCR capable after I >> converted it from markdown to pdf. Any idea? >> >> On Monday, June 25, 2018 at 6:47:56 PM UTC+8, Wei Wooi Peh wrote: >>> >>> Hi All, >>> >>> Good day. >>> >>> Would like to confirm, If I wish to convert my markdow to pdf format, >>> will the pdf created is OCR capable? >>> >>> Please advise, thanks. >>> >>> Best regards, >>> WeiWooi >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "pandoc-discuss" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> To view this discussion on the web visit https://groups.google.com/d/ >> msgid/pandoc-discuss/ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457% >> 40googlegroups.com >> <https://groups.google.com/d/msgid/pandoc-discuss/ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/ > msgid/pandoc-discuss/CAFVhNZNEq6SjiPihWiRrdstJ5BM4R > UUBNZs2sjkTyWTRMsHeXQ%40mail.gmail.com > <https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZNEq6SjiPihWiRrdstJ5BM4RUUBNZs2sjkTyWTRMsHeXQ%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAAcdRmn8na1nCexX16pWoNv22m8r%2BKAnERoeMA7gDqWVRSrXVw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #1.2: Type: text/html, Size: 6463 bytes --] [-- Attachment #2: image.png --] [-- Type: image/png, Size: 51485 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <CAAcdRmn8na1nCexX16pWoNv22m8r+KAnERoeMA7gDqWVRSrXVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Is Pandoc OCR Capable [not found] ` <CAAcdRmn8na1nCexX16pWoNv22m8r+KAnERoeMA7gDqWVRSrXVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2018-06-26 13:17 ` Robert Zenz 2018-06-26 21:00 ` Jeremy Theler 1 sibling, 0 replies; 22+ messages in thread From: Robert Zenz @ 2018-06-26 13:17 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw In that case the answer is no, Pandoc does not come with a OCR system/software of any kind. At least not as far as I know. On 26.06.2018 15:14, Wei Wooi Peh wrote: > Hi Robert Zenz and Pauolo Ney, > > Appreciate your prompt response. > > Maybe let me try to give an example > > 1. I created a document by using Markdown language and written some > statement there about my flowchart introduction > > 2. To make it more interesting, I have added in the figure table, hence I > cut and paste (or import) certain images (.png or .jpg format) into my > markdown document. If this is in image format, we are not able to > select/scan any text from the flowchart below > > ![image](media/figure1.jpg) > > > > > 3. What I need is when we converted my entire markdown document into pdf > format, it will convert my images (.png or .jpg) become OCR cable as well. > > Hopefully with this explanation will make my question much more clearer. > > Thanks > > Best regards, > WeiWooi > > > On Tue, Jun 26, 2018 at 8:58 PM, Paulo Ney de Souza <pauloney-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> > wrote: > >> Let me see if I can clarify the situation a bit. First Wei Wooi, please be >> clear on what you mean by being "OCR capable". There are many contexts >> possible here. For example, if you have an image - with a page of text - >> and that is included in HTML and converted to PDF -- Pandoc will NOT read >> (or OCR your image) to change it into a PDF with text. >> >> Making Robert Zenz second previous message - TeX in general (any of the >> engines including all of MikTeX and TeXLive) DO preserve all text into the >> PDF file -- if you use DEFAULT settings. It is obviously possible to >> prevent TeX from preserving the text. >> >> So please send us a brief example of what you are doing, so we can better >> help you. >> >> Paulo Ney >> >> >> On Tue, Jun 26, 2018 at 5:46 AM Wei Wooi Peh <pehweiwooi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: >> >>> I have tried with wkhtmltopdf....It seems like is not OCR capable after I >>> converted it from markdown to pdf. Any idea? >>> >>> On Monday, June 25, 2018 at 6:47:56 PM UTC+8, Wei Wooi Peh wrote: >>>> >>>> Hi All, >>>> >>>> Good day. >>>> >>>> Would like to confirm, If I wish to convert my markdow to pdf format, >>>> will the pdf created is OCR capable? >>>> >>>> Please advise, thanks. >>>> >>>> Best regards, >>>> WeiWooi >>>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "pandoc-discuss" group. >>> To unsubscribe from this group and stop receiving emails from it, send an >>> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >>> To view this discussion on the web visit https://groups.google.com/d/ >>> msgid/pandoc-discuss/ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457% >>> 40googlegroups.com >>> <https://groups.google.com/d/msgid/pandoc-discuss/ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "pandoc-discuss" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> To view this discussion on the web visit https://groups.google.com/d/ >> msgid/pandoc-discuss/CAFVhNZNEq6SjiPihWiRrdstJ5BM4R >> UUBNZs2sjkTyWTRMsHeXQ%40mail.gmail.com >> <https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZNEq6SjiPihWiRrdstJ5BM4RUUBNZs2sjkTyWTRMsHeXQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is Pandoc OCR Capable [not found] ` <CAAcdRmn8na1nCexX16pWoNv22m8r+KAnERoeMA7gDqWVRSrXVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2018-06-26 13:17 ` Robert Zenz @ 2018-06-26 21:00 ` Jeremy Theler [not found] ` <07a5f522ae7e34890861a20b74f0fa034f640b63.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org> 1 sibling, 1 reply; 22+ messages in thread From: Jeremy Theler @ 2018-06-26 21:00 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw [-- Attachment #1.1: Type: text/plain, Size: 1209 bytes --] Why would you insert this very nice figure into a document as a bitmap (.png or .jpg) and not as a vector graphic (.svg .pdf or .eps)?From a pure technical point of view, only pictures should be bitmaps. And this figure is not a picture. On Tue, 2018-06-26 at 21:14 +0800, Wei Wooi Peh wrote: > 2. To make it more interesting, I have added in the figure table, hence I cut and paste (or import) certain images (.png or .jpg format) into my markdown document. If this is in image format, we are not able to select/scan any text from the flowchart below > > ![image](media/figure1.jpg) > > > > > > > > > > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/07a5f522ae7e34890861a20b74f0fa034f640b63.camel%40seamplex.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #1.2: Type: text/html, Size: 2414 bytes --] [-- Attachment #2: image.png --] [-- Type: image/png, Size: 51485 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <07a5f522ae7e34890861a20b74f0fa034f640b63.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org>]
* Re: Is Pandoc OCR Capable [not found] ` <07a5f522ae7e34890861a20b74f0fa034f640b63.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org> @ 2018-06-27 3:28 ` Ivan Lazar Miljenovic 0 siblings, 0 replies; 22+ messages in thread From: Ivan Lazar Miljenovic @ 2018-06-27 3:28 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw [-- Attachment #1.1: Type: text/plain, Size: 2350 bytes --] On Wed, 27 Jun 2018 at 07:00, Jeremy Theler <jeremy-24em0bpozeFWk0Htik3J/w@public.gmane.org> wrote: > Why would you insert this very nice figure into a document as a bitmap > (.png or .jpg) and not as a vector graphic (.svg .pdf or .eps)? > From a pure technical point of view, only pictures should be bitmaps. And > this figure is not a picture. > And if you go via a TeX backend, there's a way that Inkscape can have LaTeX do the actual text layout; not sure if this is possible with Pandoc in the mix as well though. > > On Tue, 2018-06-26 at 21:14 +0800, Wei Wooi Peh wrote: > > > 2. To make it more interesting, I have added in the figure table, hence I > cut and paste (or import) certain images (.png or .jpg format) into my > markdown document. If this is in image format, we are not able to > select/scan any text from the flowchart below > > ![image](media/figure1.jpg) > > > > > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/07a5f522ae7e34890861a20b74f0fa034f640b63.camel%40seamplex.com > <https://groups.google.com/d/msgid/pandoc-discuss/07a5f522ae7e34890861a20b74f0fa034f640b63.camel%40seamplex.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- Ivan Lazar Miljenovic Ivan.Miljenovic-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org http://IvanMiljenovic.wordpress.com -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CA%2Bu6gbxfF1AV4kC5SYXrPTBvWF9zTB142z31u4jqa%3D6uHwqf9g%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #1.2: Type: text/html, Size: 4492 bytes --] [-- Attachment #2: image.png --] [-- Type: image/png, Size: 51485 bytes --] [-- Attachment #3: image.png --] [-- Type: image/png, Size: 51485 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is Pandoc OCR Capable [not found] ` <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2018-06-25 12:26 ` Robert Zenz 2018-06-26 12:46 ` Wei Wooi Peh @ 2018-06-26 13:25 ` Eduardo Grosclaude 2018-06-26 13:31 ` Eduardo Grosclaude ` (2 subsequent siblings) 5 siblings, 0 replies; 22+ messages in thread From: Eduardo Grosclaude @ 2018-06-26 13:25 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 837 bytes --] On Monday, June 25, 2018 at 7:47:56 AM UTC-3, Wei Wooi Peh wrote: > > Hi All, > > Good day. > > Would like to confirm, If I wish to convert my markdow to pdf format, will > the pdf created is OCR capable? > > Please advise, thanks. > > Best regards, > WeiWooi > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/4bb10f24-6896-4355-a348-f4cf0c8d4d5f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #1.2: Type: text/html, Size: 1487 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is Pandoc OCR Capable [not found] ` <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> ` (2 preceding siblings ...) 2018-06-26 13:25 ` Eduardo Grosclaude @ 2018-06-26 13:31 ` Eduardo Grosclaude [not found] ` <9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2018-06-28 12:36 ` Christophe Demko 2018-07-03 12:08 ` CR 5 siblings, 1 reply; 22+ messages in thread From: Eduardo Grosclaude @ 2018-06-26 13:31 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 1471 bytes --] I think we still do not understand what you mean by being "OCR capable". Perhaps you want to submit your final PDF to an OCR program to recover the text that is inside the graphics? Then the answer would be yes, inasmuch the fonts and colors are readable to the OCR program... but this has nothing to do with Pandoc as far as I see. Any PDF with graphics, pandoc-generated or not, should be "OCR-capable", if this is what you mean. If you mean that the text be extracted from the graphics by some other means, maybe you should give a look to SVG graphic format. With SVG text is embodied in the graphics and can even be edited. On Monday, June 25, 2018 at 7:47:56 AM UTC-3, Wei Wooi Peh wrote: > > Hi All, > > Good day. > > Would like to confirm, If I wish to convert my markdow to pdf format, will > the pdf created is OCR capable? > > Please advise, thanks. > > Best regards, > WeiWooi > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #1.2: Type: text/html, Size: 2152 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: Is Pandoc OCR Capable [not found] ` <9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2018-06-26 13:41 ` Wei Wooi Peh [not found] ` <CAAcdRmnLizCoq2Y8_x-M7nqb7R=D_WpHXAzdi1nz5S+vtXwY4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 22+ messages in thread From: Wei Wooi Peh @ 2018-06-26 13:41 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw [-- Attachment #1: Type: text/plain, Size: 3080 bytes --] Hi Eduardo Grosclaude, Thank you for the response. May be I try to but it this way, After I've converted my markdown (Text + images) to pdf, then I try to open up the pdf document and click on the images, I should be able to select the text inside the images and without send to any OCR program again to recover the text that inside the image. For example: In markdown, I only able to select the text from my statement (text) but not the text inside the images (because is in .jpg or .png) After using pandoc with wkhtmltopdf engine converted my markdown become pdf, I am assuming my pdf document now able allow me to select the text for both statement and image Thanks Best regards, WeiWooi On Tue, Jun 26, 2018 at 9:31 PM, Eduardo Grosclaude < eduardo.grosclaude-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > I think we still do not understand what you mean by being "OCR capable". > Perhaps you want to submit your final PDF to an OCR program to recover the > text that is inside the graphics? Then the answer would be yes, inasmuch > the fonts and colors are readable to the OCR program... but this has > nothing to do with Pandoc as far as I see. Any PDF with graphics, > pandoc-generated or not, should be "OCR-capable", if this is what you mean. > If you mean that the text be extracted from the graphics by some other > means, maybe you should give a look to SVG graphic format. With SVG text is > embodied in the graphics and can even be edited. > > On Monday, June 25, 2018 at 7:47:56 AM UTC-3, Wei Wooi Peh wrote: >> >> Hi All, >> >> Good day. >> >> Would like to confirm, If I wish to convert my markdow to pdf format, >> will the pdf created is OCR capable? >> >> Please advise, thanks. >> >> Best regards, >> WeiWooi >> > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/ > msgid/pandoc-discuss/9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8% > 40googlegroups.com > <https://groups.google.com/d/msgid/pandoc-discuss/9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAAcdRmnLizCoq2Y8_x-M7nqb7R%3DD_WpHXAzdi1nz5S%2BvtXwY4g%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #2: Type: text/html, Size: 4628 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <CAAcdRmnLizCoq2Y8_x-M7nqb7R=D_WpHXAzdi1nz5S+vtXwY4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Is Pandoc OCR Capable [not found] ` <CAAcdRmnLizCoq2Y8_x-M7nqb7R=D_WpHXAzdi1nz5S+vtXwY4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2018-06-26 13:45 ` Paulo Ney de Souza 2018-06-26 14:03 ` Pablo Rodríguez 1 sibling, 0 replies; 22+ messages in thread From: Paulo Ney de Souza @ 2018-06-26 13:45 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw [-- Attachment #1: Type: text/plain, Size: 4088 bytes --] Pandoc will not be of any help with that. Paulo Ney On Tue, Jun 26, 2018 at 6:42 AM Wei Wooi Peh <pehweiwooi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > Hi Eduardo Grosclaude, > > Thank you for the response. > > May be I try to but it this way, After I've converted my markdown (Text + > images) to pdf, then I try to open up the pdf document and click on the > images, I should be able to select the text inside the images and without > send to any OCR program again to recover the text that inside the image. > > For example: > > In markdown, I only able to select the text from my statement (text) but > not the text inside the images (because is in .jpg or .png) > > After using pandoc with wkhtmltopdf engine converted my markdown become > pdf, I am assuming my pdf document now able allow me to select the text for > both statement and image > > Thanks > > Best regards, > WeiWooi > > > On Tue, Jun 26, 2018 at 9:31 PM, Eduardo Grosclaude < > eduardo.grosclaude-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > >> I think we still do not understand what you mean by being "OCR capable". >> Perhaps you want to submit your final PDF to an OCR program to recover the >> text that is inside the graphics? Then the answer would be yes, inasmuch >> the fonts and colors are readable to the OCR program... but this has >> nothing to do with Pandoc as far as I see. Any PDF with graphics, >> pandoc-generated or not, should be "OCR-capable", if this is what you mean. >> If you mean that the text be extracted from the graphics by some other >> means, maybe you should give a look to SVG graphic format. With SVG text is >> embodied in the graphics and can even be edited. >> >> On Monday, June 25, 2018 at 7:47:56 AM UTC-3, Wei Wooi Peh wrote: >>> >>> Hi All, >>> >>> Good day. >>> >>> Would like to confirm, If I wish to convert my markdow to pdf format, >>> will the pdf created is OCR capable? >>> >>> Please advise, thanks. >>> >>> Best regards, >>> WeiWooi >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "pandoc-discuss" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/pandoc-discuss/9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8%40googlegroups.com >> <https://groups.google.com/d/msgid/pandoc-discuss/9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/CAAcdRmnLizCoq2Y8_x-M7nqb7R%3DD_WpHXAzdi1nz5S%2BvtXwY4g%40mail.gmail.com > <https://groups.google.com/d/msgid/pandoc-discuss/CAAcdRmnLizCoq2Y8_x-M7nqb7R%3DD_WpHXAzdi1nz5S%2BvtXwY4g%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZPYrUsxWeLg-mgauTH96TCQcw2212MsyexqESU4scU%2BnQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #2: Type: text/html, Size: 6142 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is Pandoc OCR Capable [not found] ` <CAAcdRmnLizCoq2Y8_x-M7nqb7R=D_WpHXAzdi1nz5S+vtXwY4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2018-06-26 13:45 ` Paulo Ney de Souza @ 2018-06-26 14:03 ` Pablo Rodríguez 1 sibling, 0 replies; 22+ messages in thread From: Pablo Rodríguez @ 2018-06-26 14:03 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw On 06/26/2018 03:41 PM, Wei Wooi Peh wrote: > [...] > May be I try to but it this way, After I've converted my markdown (Text > + images) to pdf, then I try to open up the pdf document and click on > the images, I should be able to select the text inside the images and > without send to any OCR program again to recover the text that inside > the image. Hi WeiWooi, I thought you were interested in tagged PDF documents, but I don’t think that fits in your explanation above. > In markdown, I only able to select the text from my statement (text) but > not the text inside the images (because is in .jpg or .png) You mean to have vector images with text inside, don’t you? I don’t think pandoc (or LaTeX) has anything to do here with the images. Bitmap images cannot contain text. They only contain text as bitmaps. OCR software doesn’t read that, but only recognize patterns and render them as text. If you want text in your images, you will need to generate them as vector images in a format that also allows text inside (either SVG, PostScript, EPS or PDF). I hope it helps, Pablo -- http://www.ousia.tk -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/a30d5e0e-9e4e-2379-cd5c-901a530ce5e6%40web.de. For more options, visit https://groups.google.com/d/optout. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is Pandoc OCR Capable [not found] ` <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> ` (3 preceding siblings ...) 2018-06-26 13:31 ` Eduardo Grosclaude @ 2018-06-28 12:36 ` Christophe Demko [not found] ` <6ca16ea7-1a04-4c2e-8856-e75b5465d853-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2018-07-03 12:08 ` CR 5 siblings, 1 reply; 22+ messages in thread From: Christophe Demko @ 2018-06-28 12:36 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 1057 bytes --] With inkscape and a filter able to transform svg image to LaTeX code, you can include image text in your final pdf. https://tex.stackexchange.com/questions/2099/how-to-include-svg-diagrams-in-latex Le lundi 25 juin 2018 12:47:56 UTC+2, Wei Wooi Peh a écrit : > > Hi All, > > Good day. > > Would like to confirm, If I wish to convert my markdow to pdf format, will > the pdf created is OCR capable? > > Please advise, thanks. > > Best regards, > WeiWooi > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/6ca16ea7-1a04-4c2e-8856-e75b5465d853%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #1.2: Type: text/html, Size: 1682 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <6ca16ea7-1a04-4c2e-8856-e75b5465d853-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: Is Pandoc OCR Capable [not found] ` <6ca16ea7-1a04-4c2e-8856-e75b5465d853-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2018-06-28 12:51 ` Jeremy Theler [not found] ` <e0252af0901d3a6a413f35a4f5c008ad8a13c41d.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 22+ messages in thread From: Jeremy Theler @ 2018-06-28 12:51 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw [-- Attachment #1: Type: text/plain, Size: 2207 bytes --] No need to do that. Just create an svg in inkscape, convert it to pdf and include it from Pandoc. gtheler@tom:~/run/pandoc$ inkscape --export-pdf=text.pdf text.svg gtheler@tom:~/run/pandoc$ pandoc document.md -o document.pdf See the example pdf, you can select "text in a figure" as text. The best way to solve a problem is to avoid it. I do not understand why people kill vector figures by converting them to bitmaps. -- jeremy theler www.seampex.com On Thu, 2018-06-28 at 05:36 -0700, Christophe Demko wrote: > With inkscape and a filter able to transform svg image to LaTeX code, you can include image text in your final pdf. https://tex.stackexchange.com/questions/2099/how-to-include-svg-diagrams-in-latex > > Le lundi 25 juin 2018 12:47:56 UTC+2, Wei Wooi Peh a écrit : > > Hi All, > > > > Good day. > > > > Would like to confirm, If I wish to convert my markdow to pdf format, will the pdf created is OCR capable? > > > > Please advise, thanks. > > > > Best regards, > > WeiWooi > > -- > You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/6ca16ea7-1a04-4c2e-8856-e75b5465d853%40googlegroups.com. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/e0252af0901d3a6a413f35a4f5c008ad8a13c41d.camel%40seamplex.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #2: document.md --] [-- Type: text/markdown, Size: 95 bytes --] # Header This is real text paragraph. ![Caption](text.pdf) This is another text pragraph. [-- Attachment #3: text.svg --] [-- Type: image/svg+xml, Size: 2556 bytes --] [-- Attachment #4: document.pdf --] [-- Type: application/pdf, Size: 55346 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <e0252af0901d3a6a413f35a4f5c008ad8a13c41d.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org>]
* Re: Is Pandoc OCR Capable [not found] ` <e0252af0901d3a6a413f35a4f5c008ad8a13c41d.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org> @ 2018-06-28 13:51 ` Shawn H Corey 2018-06-29 15:14 ` Christophe Demko 1 sibling, 0 replies; 22+ messages in thread From: Shawn H Corey @ 2018-06-28 13:51 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw On Thu, 28 Jun 2018 09:51:20 -0300 Jeremy Theler <jeremy-24em0bpozeFWk0Htik3J/w@public.gmane.org> wrote: > The best way to solve a problem is to avoid it. I do not understand > why people kill vector figures by converting them to bitmaps. Because at one time, it was the only way to do it. There was no direct way to convert vectors into PS or PDF. Old habits die hard. -- Don't stop where the ink does. Shawn H Corey ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is Pandoc OCR Capable [not found] ` <e0252af0901d3a6a413f35a4f5c008ad8a13c41d.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org> 2018-06-28 13:51 ` Shawn H Corey @ 2018-06-29 15:14 ` Christophe Demko 1 sibling, 0 replies; 22+ messages in thread From: Christophe Demko @ 2018-06-29 15:14 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 2457 bytes --] I did not know that pandoc could include pdf file as an image Le jeudi 28 juin 2018 14:51:26 UTC+2, Jeremy Theler a écrit : > > No need to do that. > Just create an svg in inkscape, convert it to pdf and include it from > Pandoc. > > gtheler@tom:~/run/pandoc$ inkscape --export-pdf=text.pdf text.svg > gtheler@tom:~/run/pandoc$ pandoc document.md -o document.pdf > > See the example pdf, you can select "text in a figure" as text. > > The best way to solve a problem is to avoid it. I do not understand why > people kill vector figures by converting them to bitmaps. > > > -- > jeremy theler > www.seampex.com > > > On Thu, 2018-06-28 at 05:36 -0700, Christophe Demko wrote: > > With inkscape and a filter able to transform svg image to LaTeX code, > you can include image text in your final pdf. > https://tex.stackexchange.com/questions/2099/how-to-include-svg-diagrams-in-latex > > > > Le lundi 25 juin 2018 12:47:56 UTC+2, Wei Wooi Peh a écrit : > > > Hi All, > > > > > > Good day. > > > > > > Would like to confirm, If I wish to convert my markdow to pdf format, > will the pdf created is OCR capable? > > > > > > Please advise, thanks. > > > > > > Best regards, > > > WeiWooi > > > > -- > > You received this message because you are subscribed to the Google > Groups "pandoc-discuss" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>. > > To post to this group, send email to pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org > <javascript:>. > > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/6ca16ea7-1a04-4c2e-8856-e75b5465d853%40googlegroups.com. > > > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/32a63e02-f89b-4ba9-953f-e2b80324097f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #1.2: Type: text/html, Size: 5211 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is Pandoc OCR Capable [not found] ` <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> ` (4 preceding siblings ...) 2018-06-28 12:36 ` Christophe Demko @ 2018-07-03 12:08 ` CR [not found] ` <cf83dc69-7e1c-44e2-92d0-003de00c4a96-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 5 siblings, 1 reply; 22+ messages in thread From: CR @ 2018-07-03 12:08 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 1897 bytes --] I just wanted to comment on PDF and OCR, since I do a bit of both. A PDF is just a package, a container. It can contain images, text, or both. If a PDF is created from Word, then it is normally easy to extract the text if there is no security. Extracting tables from a PDF is a different matter though. When Google was first scanning in public domain books they just scanned an image of each page and placed that into a PDF. These images (as a PDF) did poorly with OCR software for many reason: the page was crooked in the image, OCR doesn't do italics well, fonts were non-standard, etc. But some PDF images can be OCRd but they need a lot of cleanup. I do this on a regular basis. Google taking the quick and easy route in saving books in digital format, and the poor OCR results from Google books, is where https://www.pgdp.net/ (from Project Gutenberg) fills a niche. These people actually type in the text from an image of each page, but it's a long multiple step process to finish the book. https://www.online-convert.com is free and has the option to extract text in a PDF, and do OCR on images in the text if it's from Google Docs. It supports almost unlimited pages and is the best free option I've found so far for converting PDF to plain text (which I then convert to Markdown then EPUB). -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/cf83dc69-7e1c-44e2-92d0-003de00c4a96%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #1.2: Type: text/html, Size: 2450 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <cf83dc69-7e1c-44e2-92d0-003de00c4a96-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: Is Pandoc OCR Capable [not found] ` <cf83dc69-7e1c-44e2-92d0-003de00c4a96-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2018-07-03 21:27 ` BP Jonsson 0 siblings, 0 replies; 22+ messages in thread From: BP Jonsson @ 2018-07-03 21:27 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw [-- Attachment #1: Type: text/plain, Size: 2973 bytes --] I convert PDF to DOCX with Google docs and then convert the DOCX to Markdown with Pandoc. That saves a lot of work compared to converting PDF to non-Markdown text and then add Markdown markup manually. Den tis 3 jul 2018 14:08CR <chuckr69-Wuw85uim5zDR7s880joybQ@public.gmane.org> skrev: > > I just wanted to comment on PDF and OCR, since I do a bit of both. > > A PDF is just a package, a container. It can contain images, text, or > both. If a PDF is created from Word, then it is normally easy to extract > the text if there is no security. Extracting tables from a PDF is a > different matter though. > > When Google was first scanning in public domain books they just scanned an > image of each page and placed that into a PDF. These images (as a PDF) did > poorly with OCR software for many reason: the page was crooked in the > image, OCR doesn't do italics well, fonts were non-standard, etc. But some > PDF images can be OCRd but they need a lot of cleanup. I do this on a > regular basis. > > Google taking the quick and easy route in saving books in digital format, > and the poor OCR results from Google books, is where https://www.pgdp.net/ > (from Project Gutenberg) fills a niche. These people actually type in the > text from an image of each page, but it's a long multiple step process to > finish the book. > > https://www.online-convert.com is free and has the option to extract text > in a PDF, and do OCR on images in the text if it's from Google Docs. It > supports almost unlimited pages and is the best free option I've found so > far for converting PDF to plain text (which I then convert to Markdown then > EPUB). > > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/cf83dc69-7e1c-44e2-92d0-003de00c4a96%40googlegroups.com > <https://groups.google.com/d/msgid/pandoc-discuss/cf83dc69-7e1c-44e2-92d0-003de00c4a96%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFC_yuTL346rr%3DkCV_u%2BJJk6T%3DfG%2BZmSWntmTftBZO-29no1jw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #2: Type: text/html, Size: 4263 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2018-07-03 21:27 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-06-25 10:47 Is Pandoc OCR Capable Wei Wooi Peh [not found] ` <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2018-06-25 12:26 ` Robert Zenz [not found] ` <5B30DF86.8020307-q1xk7osDwJUWQnjQ7V0W7w@public.gmane.org> 2018-06-25 13:36 ` Wei Wooi Peh [not found] ` <311aad2d-67b2-4b41-be6c-8b23b1054b81-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2018-06-25 13:46 ` Robert Zenz 2018-06-26 12:46 ` Wei Wooi Peh [not found] ` <ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2018-06-26 12:51 ` Robert Zenz 2018-06-26 12:58 ` Paulo Ney de Souza [not found] ` <CAFVhNZNEq6SjiPihWiRrdstJ5BM4RUUBNZs2sjkTyWTRMsHeXQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2018-06-26 13:14 ` Wei Wooi Peh [not found] ` <CAAcdRmn8na1nCexX16pWoNv22m8r+KAnERoeMA7gDqWVRSrXVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2018-06-26 13:17 ` Robert Zenz 2018-06-26 21:00 ` Jeremy Theler [not found] ` <07a5f522ae7e34890861a20b74f0fa034f640b63.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org> 2018-06-27 3:28 ` Ivan Lazar Miljenovic 2018-06-26 13:25 ` Eduardo Grosclaude 2018-06-26 13:31 ` Eduardo Grosclaude [not found] ` <9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2018-06-26 13:41 ` Wei Wooi Peh [not found] ` <CAAcdRmnLizCoq2Y8_x-M7nqb7R=D_WpHXAzdi1nz5S+vtXwY4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2018-06-26 13:45 ` Paulo Ney de Souza 2018-06-26 14:03 ` Pablo Rodríguez 2018-06-28 12:36 ` Christophe Demko [not found] ` <6ca16ea7-1a04-4c2e-8856-e75b5465d853-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2018-06-28 12:51 ` Jeremy Theler [not found] ` <e0252af0901d3a6a413f35a4f5c008ad8a13c41d.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org> 2018-06-28 13:51 ` Shawn H Corey 2018-06-29 15:14 ` Christophe Demko 2018-07-03 12:08 ` CR [not found] ` <cf83dc69-7e1c-44e2-92d0-003de00c4a96-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2018-07-03 21:27 ` BP Jonsson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).