Is Pandoc OCR Capable

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* Is Pandoc OCR Capable
@ 2018-06-25 10:47 Wei Wooi Peh
       [not found] ` <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Wei Wooi Peh @ 2018-06-25 10:47 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 747 bytes --]

Hi All,

Good day.

Would like to confirm, If I wish to convert my markdow to pdf format, will 
the pdf created is OCR capable?

Please advise, thanks.

Best regards,
WeiWooi

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/c375467c-7686-4111-9b92-778b7d8f3b59%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 1264 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]

* Re: Is Pandoc OCR Capable
       [not found] ` <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2018-06-25 12:26   ` Robert Zenz
       [not found]     ` <5B30DF86.8020307-q1xk7osDwJUWQnjQ7V0W7w@public.gmane.org>
  2018-06-26 12:46   ` Wei Wooi Peh
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Robert Zenz @ 2018-06-25 12:26 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

I think you got your terms mixed up. Most of the time OCR means "Optical
Character Recognition" and means to extract characters from an image.

But I guess you wanted to know whether the text in a PDF generated by Pandoc is
treated as text by the PDF viewer, the answer is, it depends. I can only speak
for wkhtmltopdf as driver, and the answer is yes.

On 25.06.2018 12:47, Wei Wooi Peh wrote:
> Hi All,
> 
> Good day.
> 
> Would like to confirm, If I wish to convert my markdow to pdf format, will 
> the pdf created is OCR capable?
> 
> Please advise, thanks.
> 
> Best regards,
> WeiWooi
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <5B30DF86.8020307-q1xk7osDwJUWQnjQ7V0W7w@public.gmane.org>]

* Re: Is Pandoc OCR Capable
       [not found]     ` <5B30DF86.8020307-q1xk7osDwJUWQnjQ7V0W7w@public.gmane.org>
@ 2018-06-25 13:36       ` Wei Wooi Peh
       [not found]         ` <311aad2d-67b2-4b41-be6c-8b23b1054b81-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Wei Wooi Peh @ 2018-06-25 13:36 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1445 bytes --]

Hi Robert Zenz,

Appreciate your prompt feedback. Do you know if TexLive or MikTex will have 
this capabilities also?

Thanks

Best regards,
WeiWooi

On Monday, June 25, 2018 at 8:26:56 PM UTC+8, Robert Zenz wrote:
>
> I think you got your terms mixed up. Most of the time OCR means "Optical 
> Character Recognition" and means to extract characters from an image. 
>
> But I guess you wanted to know whether the text in a PDF generated by 
> Pandoc is 
> treated as text by the PDF viewer, the answer is, it depends. I can only 
> speak 
> for wkhtmltopdf as driver, and the answer is yes. 
>
>
> On 25.06.2018 12:47, Wei Wooi Peh wrote: 
> > Hi All, 
> > 
> > Good day. 
> > 
> > Would like to confirm, If I wish to convert my markdow to pdf format, 
> will 
> > the pdf created is OCR capable? 
> > 
> > Please advise, thanks. 
> > 
> > Best regards, 
> > WeiWooi 
> > 

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/311aad2d-67b2-4b41-be6c-8b23b1054b81%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 2156 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <311aad2d-67b2-4b41-be6c-8b23b1054b81-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]

* Re: Is Pandoc OCR Capable
       [not found]         ` <311aad2d-67b2-4b41-be6c-8b23b1054b81-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2018-06-25 13:46           ` Robert Zenz
  0 siblings, 0 replies; 22+ messages in thread
From: Robert Zenz @ 2018-06-25 13:46 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

I don't know but I guess so, everything else would make little sense to me (why
would a text processor embed the text as image?).


On 25.06.2018 15:36, Wei Wooi Peh wrote:
> Hi Robert Zenz,
> 
> Appreciate your prompt feedback. Do you know if TexLive or MikTex will have 
> this capabilities also?
> 
> Thanks
> 
> Best regards,
> WeiWooi
> 
> On Monday, June 25, 2018 at 8:26:56 PM UTC+8, Robert Zenz wrote:
>>
>> I think you got your terms mixed up. Most of the time OCR means "Optical 
>> Character Recognition" and means to extract characters from an image. 
>>
>> But I guess you wanted to know whether the text in a PDF generated by 
>> Pandoc is 
>> treated as text by the PDF viewer, the answer is, it depends. I can only 
>> speak 
>> for wkhtmltopdf as driver, and the answer is yes. 
>>
>>
>> On 25.06.2018 12:47, Wei Wooi Peh wrote: 
>>> Hi All, 
>>>
>>> Good day. 
>>>
>>> Would like to confirm, If I wish to convert my markdow to pdf format, 
>> will 
>>> the pdf created is OCR capable? 
>>>
>>> Please advise, thanks. 
>>>
>>> Best regards, 
>>> WeiWooi 
>>>
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is Pandoc OCR Capable
       [not found] ` <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2018-06-25 12:26   ` Robert Zenz
@ 2018-06-26 12:46   ` Wei Wooi Peh
       [not found]     ` <ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2018-06-26 13:25   ` Eduardo Grosclaude
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Wei Wooi Peh @ 2018-06-26 12:46 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 956 bytes --]

I have tried with wkhtmltopdf....It seems like is not OCR capable after I 
converted it from markdown to pdf. Any idea?

On Monday, June 25, 2018 at 6:47:56 PM UTC+8, Wei Wooi Peh wrote:
>
> Hi All,
>
> Good day.
>
> Would like to confirm, If I wish to convert my markdow to pdf format, will 
> the pdf created is OCR capable?
>
> Please advise, thanks.
>
> Best regards,
> WeiWooi
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 1605 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]

* Re: Is Pandoc OCR Capable
       [not found]     ` <ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2018-06-26 12:51       ` Robert Zenz
  2018-06-26 12:58       ` Paulo Ney de Souza
  1 sibling, 0 replies; 22+ messages in thread
From: Robert Zenz @ 2018-06-26 12:51 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

I've tried with SumatraPDF, Google Chrome and Adobe Acrobat Reader and all three
allow me to select text in the generated PDF.

What are you trying?


On 26.06.2018 14:46, Wei Wooi Peh wrote:
> I have tried with wkhtmltopdf....It seems like is not OCR capable after I 
> converted it from markdown to pdf. Any idea?
> 
> On Monday, June 25, 2018 at 6:47:56 PM UTC+8, Wei Wooi Peh wrote:
>>
>> Hi All,
>>
>> Good day.
>>
>> Would like to confirm, If I wish to convert my markdow to pdf format, will 
>> the pdf created is OCR capable?
>>
>> Please advise, thanks.
>>
>> Best regards,
>> WeiWooi
>>
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is Pandoc OCR Capable
       [not found]     ` <ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2018-06-26 12:51       ` Robert Zenz
@ 2018-06-26 12:58       ` Paulo Ney de Souza
       [not found]         ` <CAFVhNZNEq6SjiPihWiRrdstJ5BM4RUUBNZs2sjkTyWTRMsHeXQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 22+ messages in thread
From: Paulo Ney de Souza @ 2018-06-26 12:58 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 2531 bytes --]

Let me see if I can clarify the situation a bit. First Wei Wooi, please be
clear on what you mean by being "OCR capable". There are many contexts
possible here. For example, if you have an image - with a page of text -
and that is included in HTML and converted to PDF -- Pandoc will NOT read
(or OCR your image) to change it into a PDF with text.

Making Robert Zenz second previous message - TeX in general (any of the
engines including all of MikTeX and TeXLive) DO preserve all text into the
PDF file -- if you use DEFAULT settings. It is obviously possible to
prevent TeX from preserving the text.

So please send us a brief example of what you are doing, so we can better
help you.

Paulo Ney


On Tue, Jun 26, 2018 at 5:46 AM Wei Wooi Peh <pehweiwooi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> I have tried with wkhtmltopdf....It seems like is not OCR capable after I
> converted it from markdown to pdf. Any idea?
>
> On Monday, June 25, 2018 at 6:47:56 PM UTC+8, Wei Wooi Peh wrote:
>>
>> Hi All,
>>
>> Good day.
>>
>> Would like to confirm, If I wish to convert my markdow to pdf format,
>> will the pdf created is OCR capable?
>>
>> Please advise, thanks.
>>
>> Best regards,
>> WeiWooi
>>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457%40googlegroups.com
> <https://groups.google.com/d/msgid/pandoc-discuss/ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZNEq6SjiPihWiRrdstJ5BM4RUUBNZs2sjkTyWTRMsHeXQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 3790 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <CAFVhNZNEq6SjiPihWiRrdstJ5BM4RUUBNZs2sjkTyWTRMsHeXQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: Is Pandoc OCR Capable
       [not found]         ` <CAFVhNZNEq6SjiPihWiRrdstJ5BM4RUUBNZs2sjkTyWTRMsHeXQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-06-26 13:14           ` Wei Wooi Peh
       [not found]             ` <CAAcdRmn8na1nCexX16pWoNv22m8r+KAnERoeMA7gDqWVRSrXVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Wei Wooi Peh @ 2018-06-26 13:14 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


[-- Attachment #1.1: Type: text/plain, Size: 4244 bytes --]

Hi Robert Zenz and Pauolo Ney,

Appreciate your prompt response.

Maybe let me try to give an example

1. I created a document by using Markdown language and written some
statement there about my flowchart introduction

2. To make it more interesting, I have added in the figure table, hence I
cut and paste (or import) certain images (.png or .jpg format) into my
markdown document. If this is in image format, we are not able to
select/scan any text from the flowchart below

![image](media/figure1.jpg)




3. What I need is when we converted my entire markdown document into pdf
format, it will convert my images (.png or .jpg) become OCR cable as well.

Hopefully with this explanation will make my question much more clearer.

Thanks

Best regards,
WeiWooi


On Tue, Jun 26, 2018 at 8:58 PM, Paulo Ney de Souza <pauloney-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:

> Let me see if I can clarify the situation a bit. First Wei Wooi, please be
> clear on what you mean by being "OCR capable". There are many contexts
> possible here. For example, if you have an image - with a page of text -
> and that is included in HTML and converted to PDF -- Pandoc will NOT read
> (or OCR your image) to change it into a PDF with text.
>
> Making Robert Zenz second previous message - TeX in general (any of the
> engines including all of MikTeX and TeXLive) DO preserve all text into the
> PDF file -- if you use DEFAULT settings. It is obviously possible to
> prevent TeX from preserving the text.
>
> So please send us a brief example of what you are doing, so we can better
> help you.
>
> Paulo Ney
>
>
> On Tue, Jun 26, 2018 at 5:46 AM Wei Wooi Peh <pehweiwooi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>> I have tried with wkhtmltopdf....It seems like is not OCR capable after I
>> converted it from markdown to pdf. Any idea?
>>
>> On Monday, June 25, 2018 at 6:47:56 PM UTC+8, Wei Wooi Peh wrote:
>>>
>>> Hi All,
>>>
>>> Good day.
>>>
>>> Would like to confirm, If I wish to convert my markdow to pdf format,
>>> will the pdf created is OCR capable?
>>>
>>> Please advise, thanks.
>>>
>>> Best regards,
>>> WeiWooi
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "pandoc-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit https://groups.google.com/d/
>> msgid/pandoc-discuss/ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457%
>> 40googlegroups.com
>> <https://groups.google.com/d/msgid/pandoc-discuss/ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/pandoc-discuss/CAFVhNZNEq6SjiPihWiRrdstJ5BM4R
> UUBNZs2sjkTyWTRMsHeXQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZNEq6SjiPihWiRrdstJ5BM4RUUBNZs2sjkTyWTRMsHeXQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAAcdRmn8na1nCexX16pWoNv22m8r%2BKAnERoeMA7gDqWVRSrXVw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 6463 bytes --]

[-- Attachment #2: image.png --]
[-- Type: image/png, Size: 51485 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <CAAcdRmn8na1nCexX16pWoNv22m8r+KAnERoeMA7gDqWVRSrXVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: Is Pandoc OCR Capable
       [not found]             ` <CAAcdRmn8na1nCexX16pWoNv22m8r+KAnERoeMA7gDqWVRSrXVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-06-26 13:17               ` Robert Zenz
  2018-06-26 21:00               ` Jeremy Theler
  1 sibling, 0 replies; 22+ messages in thread
From: Robert Zenz @ 2018-06-26 13:17 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

In that case the answer is no, Pandoc does not come with a OCR system/software
of any kind. At least not as far as I know.


On 26.06.2018 15:14, Wei Wooi Peh wrote:
> Hi Robert Zenz and Pauolo Ney,
> 
> Appreciate your prompt response.
> 
> Maybe let me try to give an example
> 
> 1. I created a document by using Markdown language and written some
> statement there about my flowchart introduction
> 
> 2. To make it more interesting, I have added in the figure table, hence I
> cut and paste (or import) certain images (.png or .jpg format) into my
> markdown document. If this is in image format, we are not able to
> select/scan any text from the flowchart below
> 
> ![image](media/figure1.jpg)
> 
> 
> 
> 
> 3. What I need is when we converted my entire markdown document into pdf
> format, it will convert my images (.png or .jpg) become OCR cable as well.
> 
> Hopefully with this explanation will make my question much more clearer.
> 
> Thanks
> 
> Best regards,
> WeiWooi
> 
> 
> On Tue, Jun 26, 2018 at 8:58 PM, Paulo Ney de Souza <pauloney-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> wrote:
> 
>> Let me see if I can clarify the situation a bit. First Wei Wooi, please be
>> clear on what you mean by being "OCR capable". There are many contexts
>> possible here. For example, if you have an image - with a page of text -
>> and that is included in HTML and converted to PDF -- Pandoc will NOT read
>> (or OCR your image) to change it into a PDF with text.
>>
>> Making Robert Zenz second previous message - TeX in general (any of the
>> engines including all of MikTeX and TeXLive) DO preserve all text into the
>> PDF file -- if you use DEFAULT settings. It is obviously possible to
>> prevent TeX from preserving the text.
>>
>> So please send us a brief example of what you are doing, so we can better
>> help you.
>>
>> Paulo Ney
>>
>>
>> On Tue, Jun 26, 2018 at 5:46 AM Wei Wooi Peh <pehweiwooi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>>> I have tried with wkhtmltopdf....It seems like is not OCR capable after I
>>> converted it from markdown to pdf. Any idea?
>>>
>>> On Monday, June 25, 2018 at 6:47:56 PM UTC+8, Wei Wooi Peh wrote:
>>>>
>>>> Hi All,
>>>>
>>>> Good day.
>>>>
>>>> Would like to confirm, If I wish to convert my markdow to pdf format,
>>>> will the pdf created is OCR capable?
>>>>
>>>> Please advise, thanks.
>>>>
>>>> Best regards,
>>>> WeiWooi
>>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "pandoc-discuss" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/pandoc-discuss/ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/pandoc-discuss/ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "pandoc-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit https://groups.google.com/d/
>> msgid/pandoc-discuss/CAFVhNZNEq6SjiPihWiRrdstJ5BM4R
>> UUBNZs2sjkTyWTRMsHeXQ%40mail.gmail.com
>> <https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZNEq6SjiPihWiRrdstJ5BM4RUUBNZs2sjkTyWTRMsHeXQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is Pandoc OCR Capable
       [not found]             ` <CAAcdRmn8na1nCexX16pWoNv22m8r+KAnERoeMA7gDqWVRSrXVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-06-26 13:17               ` Robert Zenz
@ 2018-06-26 21:00               ` Jeremy Theler
       [not found]                 ` <07a5f522ae7e34890861a20b74f0fa034f640b63.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 22+ messages in thread
From: Jeremy Theler @ 2018-06-26 21:00 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


[-- Attachment #1.1: Type: text/plain, Size: 1209 bytes --]

Why would you insert this very nice figure into a document as a bitmap (.png or .jpg) and not as a vector graphic (.svg .pdf or .eps)?From a pure technical point of view, only pictures should be bitmaps. And this figure is not a picture.
On Tue, 2018-06-26 at 21:14 +0800, Wei Wooi Peh wrote:
> 2. To make it more interesting, I have added in the figure table, hence I cut and paste (or import) certain images (.png or .jpg format) into my markdown document. If this is in image format, we are not able to select/scan any text from the flowchart below
> 
> ![image](media/figure1.jpg)
> 
> 
> 
> 
> > 
> > 
> > 

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/07a5f522ae7e34890861a20b74f0fa034f640b63.camel%40seamplex.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 2414 bytes --]

[-- Attachment #2: image.png --]
[-- Type: image/png, Size: 51485 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <07a5f522ae7e34890861a20b74f0fa034f640b63.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org>]

* Re: Is Pandoc OCR Capable
       [not found]                 ` <07a5f522ae7e34890861a20b74f0fa034f640b63.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org>
@ 2018-06-27  3:28                   ` Ivan Lazar Miljenovic
  0 siblings, 0 replies; 22+ messages in thread
From: Ivan Lazar Miljenovic @ 2018-06-27  3:28 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


[-- Attachment #1.1: Type: text/plain, Size: 2350 bytes --]

On Wed, 27 Jun 2018 at 07:00, Jeremy Theler <jeremy-24em0bpozeFWk0Htik3J/w@public.gmane.org> wrote:

> Why would you insert this very nice figure into a document as a bitmap
> (.png or .jpg) and not as a vector graphic (.svg .pdf or .eps)?
> From a pure technical point of view, only pictures should be bitmaps. And
> this figure is not a picture.
>

And if you go via a TeX backend, there's a way that Inkscape can have LaTeX
do the actual text layout; not sure if this is possible with Pandoc in the
mix as well though.


>
> On Tue, 2018-06-26 at 21:14 +0800, Wei Wooi Peh wrote:
>
>
> 2. To make it more interesting, I have added in the figure table, hence I
> cut and paste (or import) certain images (.png or .jpg format) into my
> markdown document. If this is in image format, we are not able to
> select/scan any text from the flowchart below
>
> ![image](media/figure1.jpg)
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/07a5f522ae7e34890861a20b74f0fa034f640b63.camel%40seamplex.com
> <https://groups.google.com/d/msgid/pandoc-discuss/07a5f522ae7e34890861a20b74f0fa034f640b63.camel%40seamplex.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 
Ivan Lazar Miljenovic
Ivan.Miljenovic-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
http://IvanMiljenovic.wordpress.com

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CA%2Bu6gbxfF1AV4kC5SYXrPTBvWF9zTB142z31u4jqa%3D6uHwqf9g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 4492 bytes --]

[-- Attachment #2: image.png --]
[-- Type: image/png, Size: 51485 bytes --]

[-- Attachment #3: image.png --]
[-- Type: image/png, Size: 51485 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is Pandoc OCR Capable
       [not found] ` <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2018-06-25 12:26   ` Robert Zenz
  2018-06-26 12:46   ` Wei Wooi Peh
@ 2018-06-26 13:25   ` Eduardo Grosclaude
  2018-06-26 13:31   ` Eduardo Grosclaude
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Eduardo Grosclaude @ 2018-06-26 13:25 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 837 bytes --]



On Monday, June 25, 2018 at 7:47:56 AM UTC-3, Wei Wooi Peh wrote:
>
> Hi All,
>
> Good day.
>
> Would like to confirm, If I wish to convert my markdow to pdf format, will 
> the pdf created is OCR capable?
>
> Please advise, thanks.
>
> Best regards,
> WeiWooi
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/4bb10f24-6896-4355-a348-f4cf0c8d4d5f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 1487 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is Pandoc OCR Capable
       [not found] ` <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
                     ` (2 preceding siblings ...)
  2018-06-26 13:25   ` Eduardo Grosclaude
@ 2018-06-26 13:31   ` Eduardo Grosclaude
       [not found]     ` <9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2018-06-28 12:36   ` Christophe Demko
  2018-07-03 12:08   ` CR
  5 siblings, 1 reply; 22+ messages in thread
From: Eduardo Grosclaude @ 2018-06-26 13:31 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1471 bytes --]

I think we still do not understand what you mean by being "OCR capable". 
Perhaps you want to submit your final PDF to an OCR program to recover the 
text that is inside the graphics? Then the answer would be yes, inasmuch 
the fonts and colors are readable to the OCR program... but this has 
nothing to do with Pandoc as far as I see. Any PDF with graphics, 
pandoc-generated or not, should be "OCR-capable", if this is what you mean.
If you mean that the text be extracted from the graphics by some other 
means, maybe you should give a look to SVG graphic format. With SVG text is 
embodied in the graphics and can even be edited.

On Monday, June 25, 2018 at 7:47:56 AM UTC-3, Wei Wooi Peh wrote:
>
> Hi All,
>
> Good day.
>
> Would like to confirm, If I wish to convert my markdow to pdf format, will 
> the pdf created is OCR capable?
>
> Please advise, thanks.
>
> Best regards,
> WeiWooi
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 2152 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]

* Re: Is Pandoc OCR Capable
       [not found]     ` <9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2018-06-26 13:41       ` Wei Wooi Peh
       [not found]         ` <CAAcdRmnLizCoq2Y8_x-M7nqb7R=D_WpHXAzdi1nz5S+vtXwY4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Wei Wooi Peh @ 2018-06-26 13:41 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 3080 bytes --]

Hi Eduardo Grosclaude,

Thank you for the response.

May be I try to but it this way, After I've converted my markdown (Text +
images) to pdf, then I try to open up the pdf document and click on the
images, I should be able to select the text inside the images and without
send to any OCR program again to recover the text that inside the image.

For example:

In markdown, I only able to select the text from my statement (text) but
not the text inside the images (because is in .jpg or .png)

After using pandoc with wkhtmltopdf engine converted my markdown become
pdf, I am assuming my pdf document now able allow me to select the text for
both statement and image

Thanks

Best regards,
WeiWooi


On Tue, Jun 26, 2018 at 9:31 PM, Eduardo Grosclaude <
eduardo.grosclaude-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> I think we still do not understand what you mean by being "OCR capable".
> Perhaps you want to submit your final PDF to an OCR program to recover the
> text that is inside the graphics? Then the answer would be yes, inasmuch
> the fonts and colors are readable to the OCR program... but this has
> nothing to do with Pandoc as far as I see. Any PDF with graphics,
> pandoc-generated or not, should be "OCR-capable", if this is what you mean.
> If you mean that the text be extracted from the graphics by some other
> means, maybe you should give a look to SVG graphic format. With SVG text is
> embodied in the graphics and can even be edited.
>
> On Monday, June 25, 2018 at 7:47:56 AM UTC-3, Wei Wooi Peh wrote:
>>
>> Hi All,
>>
>> Good day.
>>
>> Would like to confirm, If I wish to convert my markdow to pdf format,
>> will the pdf created is OCR capable?
>>
>> Please advise, thanks.
>>
>> Best regards,
>> WeiWooi
>>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/pandoc-discuss/9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/pandoc-discuss/9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAAcdRmnLizCoq2Y8_x-M7nqb7R%3DD_WpHXAzdi1nz5S%2BvtXwY4g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 4628 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <CAAcdRmnLizCoq2Y8_x-M7nqb7R=D_WpHXAzdi1nz5S+vtXwY4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: Is Pandoc OCR Capable
       [not found]         ` <CAAcdRmnLizCoq2Y8_x-M7nqb7R=D_WpHXAzdi1nz5S+vtXwY4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-06-26 13:45           ` Paulo Ney de Souza
  2018-06-26 14:03           ` Pablo Rodríguez
  1 sibling, 0 replies; 22+ messages in thread
From: Paulo Ney de Souza @ 2018-06-26 13:45 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 4088 bytes --]

Pandoc will not be of any help with that.

Paulo Ney

On Tue, Jun 26, 2018 at 6:42 AM Wei Wooi Peh <pehweiwooi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> Hi Eduardo Grosclaude,
>
> Thank you for the response.
>
> May be I try to but it this way, After I've converted my markdown (Text +
> images) to pdf, then I try to open up the pdf document and click on the
> images, I should be able to select the text inside the images and without
> send to any OCR program again to recover the text that inside the image.
>
> For example:
>
> In markdown, I only able to select the text from my statement (text) but
> not the text inside the images (because is in .jpg or .png)
>
> After using pandoc with wkhtmltopdf engine converted my markdown become
> pdf, I am assuming my pdf document now able allow me to select the text for
> both statement and image
>
> Thanks
>
> Best regards,
> WeiWooi
>
>
> On Tue, Jun 26, 2018 at 9:31 PM, Eduardo Grosclaude <
> eduardo.grosclaude-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>> I think we still do not understand what you mean by being "OCR capable".
>> Perhaps you want to submit your final PDF to an OCR program to recover the
>> text that is inside the graphics? Then the answer would be yes, inasmuch
>> the fonts and colors are readable to the OCR program... but this has
>> nothing to do with Pandoc as far as I see. Any PDF with graphics,
>> pandoc-generated or not, should be "OCR-capable", if this is what you mean.
>> If you mean that the text be extracted from the graphics by some other
>> means, maybe you should give a look to SVG graphic format. With SVG text is
>> embodied in the graphics and can even be edited.
>>
>> On Monday, June 25, 2018 at 7:47:56 AM UTC-3, Wei Wooi Peh wrote:
>>>
>>> Hi All,
>>>
>>> Good day.
>>>
>>> Would like to confirm, If I wish to convert my markdow to pdf format,
>>> will the pdf created is OCR capable?
>>>
>>> Please advise, thanks.
>>>
>>> Best regards,
>>> WeiWooi
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "pandoc-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/pandoc-discuss/9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8%40googlegroups.com
>> <https://groups.google.com/d/msgid/pandoc-discuss/9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/CAAcdRmnLizCoq2Y8_x-M7nqb7R%3DD_WpHXAzdi1nz5S%2BvtXwY4g%40mail.gmail.com
> <https://groups.google.com/d/msgid/pandoc-discuss/CAAcdRmnLizCoq2Y8_x-M7nqb7R%3DD_WpHXAzdi1nz5S%2BvtXwY4g%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZPYrUsxWeLg-mgauTH96TCQcw2212MsyexqESU4scU%2BnQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 6142 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is Pandoc OCR Capable
       [not found]         ` <CAAcdRmnLizCoq2Y8_x-M7nqb7R=D_WpHXAzdi1nz5S+vtXwY4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-06-26 13:45           ` Paulo Ney de Souza
@ 2018-06-26 14:03           ` Pablo Rodríguez
  1 sibling, 0 replies; 22+ messages in thread
From: Pablo Rodríguez @ 2018-06-26 14:03 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On 06/26/2018 03:41 PM, Wei Wooi Peh wrote:
> [...] 
> May be I try to but it this way, After I've converted my markdown (Text
> + images) to pdf, then I try to open up the pdf document and click on
> the images, I should be able to select the text inside the images and
> without send to any OCR program again to recover the text that inside
> the image.

Hi WeiWooi,

I thought you were interested in tagged PDF documents, but I don’t think
that fits in your explanation above.

> In markdown, I only able to select the text from my statement (text) but
> not the text inside the images (because is in .jpg or .png)

You mean to have vector images with text inside, don’t you?

I don’t think pandoc (or LaTeX) has anything to do here with the images.
Bitmap images cannot contain text. They only contain text as bitmaps.
OCR software doesn’t read that, but only recognize patterns and render
them as text.

If you want text in your images, you will need to generate them as
vector images in a format that also allows text inside (either SVG,
PostScript, EPS or PDF).

I hope it helps,

Pablo
-- 
http://www.ousia.tk

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/a30d5e0e-9e4e-2379-cd5c-901a530ce5e6%40web.de.
For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is Pandoc OCR Capable
       [not found] ` <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
                     ` (3 preceding siblings ...)
  2018-06-26 13:31   ` Eduardo Grosclaude
@ 2018-06-28 12:36   ` Christophe Demko
       [not found]     ` <6ca16ea7-1a04-4c2e-8856-e75b5465d853-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2018-07-03 12:08   ` CR
  5 siblings, 1 reply; 22+ messages in thread
From: Christophe Demko @ 2018-06-28 12:36 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1057 bytes --]

With inkscape and a filter able to transform svg image to LaTeX code, you 
can include image text in your final 
pdf. https://tex.stackexchange.com/questions/2099/how-to-include-svg-diagrams-in-latex

Le lundi 25 juin 2018 12:47:56 UTC+2, Wei Wooi Peh a écrit :
>
> Hi All,
>
> Good day.
>
> Would like to confirm, If I wish to convert my markdow to pdf format, will 
> the pdf created is OCR capable?
>
> Please advise, thanks.
>
> Best regards,
> WeiWooi
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/6ca16ea7-1a04-4c2e-8856-e75b5465d853%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 1682 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <6ca16ea7-1a04-4c2e-8856-e75b5465d853-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]

* Re: Is Pandoc OCR Capable
       [not found]     ` <6ca16ea7-1a04-4c2e-8856-e75b5465d853-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2018-06-28 12:51       ` Jeremy Theler
       [not found]         ` <e0252af0901d3a6a413f35a4f5c008ad8a13c41d.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Jeremy Theler @ 2018-06-28 12:51 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 2207 bytes --]

No need to do that.
Just create an svg in inkscape, convert it to pdf and include it from Pandoc.

gtheler@tom:~/run/pandoc$ inkscape --export-pdf=text.pdf text.svg
gtheler@tom:~/run/pandoc$ pandoc document.md -o document.pdf

See the example pdf, you can select "text in a figure" as text.

The best way to solve a problem is to avoid it. I do not understand why people kill vector figures by converting them to bitmaps.


--
jeremy theler
www.seampex.com


On Thu, 2018-06-28 at 05:36 -0700, Christophe Demko wrote:
> With inkscape and a filter able to transform svg image to LaTeX code, you can include image text in your final pdf. https://tex.stackexchange.com/questions/2099/how-to-include-svg-diagrams-in-latex
> 
> Le lundi 25 juin 2018 12:47:56 UTC+2, Wei Wooi Peh a écrit :
> > Hi All,
> > 
> > Good day.
> > 
> > Would like to confirm, If I wish to convert my markdow to pdf format, will the pdf created is OCR capable?
> > 
> > Please advise, thanks.
> > 
> > Best regards,
> > WeiWooi
> 
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/6ca16ea7-1a04-4c2e-8856-e75b5465d853%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/e0252af0901d3a6a413f35a4f5c008ad8a13c41d.camel%40seamplex.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: document.md --]
[-- Type: text/markdown, Size: 95 bytes --]

# Header

This is real text paragraph.

![Caption](text.pdf)

This is another text pragraph.



[-- Attachment #3: text.svg --]
[-- Type: image/svg+xml, Size: 2556 bytes --]

[-- Attachment #4: document.pdf --]
[-- Type: application/pdf, Size: 55346 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <e0252af0901d3a6a413f35a4f5c008ad8a13c41d.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org>]

* Re: Is Pandoc OCR Capable
       [not found]         ` <e0252af0901d3a6a413f35a4f5c008ad8a13c41d.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org>
@ 2018-06-28 13:51           ` Shawn H Corey
  2018-06-29 15:14           ` Christophe Demko
  1 sibling, 0 replies; 22+ messages in thread
From: Shawn H Corey @ 2018-06-28 13:51 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On Thu, 28 Jun 2018 09:51:20 -0300
Jeremy Theler <jeremy-24em0bpozeFWk0Htik3J/w@public.gmane.org> wrote:

> The best way to solve a problem is to avoid it. I do not understand
> why people kill vector figures by converting them to bitmaps.

Because at one time, it was the only way to do it. There was no direct
way to convert vectors into PS or PDF. Old habits die hard.


-- 
Don't stop where the ink does.

	Shawn H Corey


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is Pandoc OCR Capable
       [not found]         ` <e0252af0901d3a6a413f35a4f5c008ad8a13c41d.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org>
  2018-06-28 13:51           ` Shawn H Corey
@ 2018-06-29 15:14           ` Christophe Demko
  1 sibling, 0 replies; 22+ messages in thread
From: Christophe Demko @ 2018-06-29 15:14 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 2457 bytes --]

I did not know that pandoc could include pdf file as an image

Le jeudi 28 juin 2018 14:51:26 UTC+2, Jeremy Theler a écrit :
>
> No need to do that. 
> Just create an svg in inkscape, convert it to pdf and include it from 
> Pandoc. 
>
> gtheler@tom:~/run/pandoc$ inkscape --export-pdf=text.pdf text.svg 
> gtheler@tom:~/run/pandoc$ pandoc document.md -o document.pdf 
>
> See the example pdf, you can select "text in a figure" as text. 
>
> The best way to solve a problem is to avoid it. I do not understand why 
> people kill vector figures by converting them to bitmaps. 
>
>
> -- 
> jeremy theler 
> www.seampex.com 
>
>
> On Thu, 2018-06-28 at 05:36 -0700, Christophe Demko wrote: 
> > With inkscape and a filter able to transform svg image to LaTeX code, 
> you can include image text in your final pdf. 
> https://tex.stackexchange.com/questions/2099/how-to-include-svg-diagrams-in-latex 
> > 
> > Le lundi 25 juin 2018 12:47:56 UTC+2, Wei Wooi Peh a écrit : 
> > > Hi All, 
> > > 
> > > Good day. 
> > > 
> > > Would like to confirm, If I wish to convert my markdow to pdf format, 
> will the pdf created is OCR capable? 
> > > 
> > > Please advise, thanks. 
> > > 
> > > Best regards, 
> > > WeiWooi 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>. 
> > To post to this group, send email to pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org 
> <javascript:>. 
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/6ca16ea7-1a04-4c2e-8856-e75b5465d853%40googlegroups.com. 
>
> > For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/32a63e02-f89b-4ba9-953f-e2b80324097f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 5211 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is Pandoc OCR Capable
       [not found] ` <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
                     ` (4 preceding siblings ...)
  2018-06-28 12:36   ` Christophe Demko
@ 2018-07-03 12:08   ` CR
       [not found]     ` <cf83dc69-7e1c-44e2-92d0-003de00c4a96-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  5 siblings, 1 reply; 22+ messages in thread
From: CR @ 2018-07-03 12:08 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1.1: Type: text/plain, Size: 1897 bytes --]

I just wanted to comment on PDF and OCR, since I do a bit of both. 

A PDF is just a package, a container. It can contain images, text, or both. 
If a PDF is created from Word, then it is normally easy to extract the text 
if there is no security. Extracting tables from a PDF is a different matter 
though. 

When Google was first scanning in public domain books they just scanned an 
image of each page and placed that into a PDF. These images (as a PDF) did 
poorly with OCR software for many reason: the page was crooked in the 
image, OCR doesn't do italics well, fonts were non-standard, etc. But some 
PDF images can be OCRd but they need a lot of cleanup. I do this on a 
regular basis. 

Google taking the quick and easy route in saving books in digital format, 
and the poor OCR results from Google books, is where https://www.pgdp.net/ 
(from Project Gutenberg) fills a niche. These people actually type in the 
text from an image of each page, but it's a long multiple step process to 
finish the book. 

https://www.online-convert.com is free and has the option to extract text 
in a PDF, and do OCR on images in the text if it's from Google Docs. It 
supports almost unlimited pages and is the best free option I've found so 
far for converting PDF to plain text (which I then convert to Markdown then 
EPUB).

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/cf83dc69-7e1c-44e2-92d0-003de00c4a96%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 2450 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <cf83dc69-7e1c-44e2-92d0-003de00c4a96-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]

* Re: Is Pandoc OCR Capable
       [not found]     ` <cf83dc69-7e1c-44e2-92d0-003de00c4a96-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2018-07-03 21:27       ` BP Jonsson
  0 siblings, 0 replies; 22+ messages in thread
From: BP Jonsson @ 2018-07-03 21:27 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 2973 bytes --]

I convert PDF to DOCX with Google docs and then convert the DOCX to
Markdown with Pandoc. That saves a lot of work compared to converting PDF
to non-Markdown text and then add Markdown markup manually.

Den tis 3 jul 2018 14:08CR <chuckr69-Wuw85uim5zDR7s880joybQ@public.gmane.org> skrev:

>
> I just wanted to comment on PDF and OCR, since I do a bit of both.
>
> A PDF is just a package, a container. It can contain images, text, or
> both. If a PDF is created from Word, then it is normally easy to extract
> the text if there is no security. Extracting tables from a PDF is a
> different matter though.
>
> When Google was first scanning in public domain books they just scanned an
> image of each page and placed that into a PDF. These images (as a PDF) did
> poorly with OCR software for many reason: the page was crooked in the
> image, OCR doesn't do italics well, fonts were non-standard, etc. But some
> PDF images can be OCRd but they need a lot of cleanup. I do this on a
> regular basis.
>
> Google taking the quick and easy route in saving books in digital format,
> and the poor OCR results from Google books, is where https://www.pgdp.net/
> (from Project Gutenberg) fills a niche. These people actually type in the
> text from an image of each page, but it's a long multiple step process to
> finish the book.
>
> https://www.online-convert.com is free and has the option to extract text
> in a PDF, and do OCR on images in the text if it's from Google Docs. It
> supports almost unlimited pages and is the best free option I've found so
> far for converting PDF to plain text (which I then convert to Markdown then
> EPUB).
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/cf83dc69-7e1c-44e2-92d0-003de00c4a96%40googlegroups.com
> <https://groups.google.com/d/msgid/pandoc-discuss/cf83dc69-7e1c-44e2-92d0-003de00c4a96%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFC_yuTL346rr%3DkCV_u%2BJJk6T%3DfG%2BZmSWntmTftBZO-29no1jw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 4263 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-07-03 21:27 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-25 10:47 Is Pandoc OCR Capable Wei Wooi Peh
     [not found] ` <c375467c-7686-4111-9b92-778b7d8f3b59-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-06-25 12:26   ` Robert Zenz
     [not found]     ` <5B30DF86.8020307-q1xk7osDwJUWQnjQ7V0W7w@public.gmane.org>
2018-06-25 13:36       ` Wei Wooi Peh
     [not found]         ` <311aad2d-67b2-4b41-be6c-8b23b1054b81-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-06-25 13:46           ` Robert Zenz
2018-06-26 12:46   ` Wei Wooi Peh
     [not found]     ` <ae2eb78f-7f99-4cc1-9cbe-1faecdb8e457-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-06-26 12:51       ` Robert Zenz
2018-06-26 12:58       ` Paulo Ney de Souza
     [not found]         ` <CAFVhNZNEq6SjiPihWiRrdstJ5BM4RUUBNZs2sjkTyWTRMsHeXQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-06-26 13:14           ` Wei Wooi Peh
     [not found]             ` <CAAcdRmn8na1nCexX16pWoNv22m8r+KAnERoeMA7gDqWVRSrXVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-06-26 13:17               ` Robert Zenz
2018-06-26 21:00               ` Jeremy Theler
     [not found]                 ` <07a5f522ae7e34890861a20b74f0fa034f640b63.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org>
2018-06-27  3:28                   ` Ivan Lazar Miljenovic
2018-06-26 13:25   ` Eduardo Grosclaude
2018-06-26 13:31   ` Eduardo Grosclaude
     [not found]     ` <9c33293a-0bf0-48e2-8a51-b18cf6dfd4b8-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-06-26 13:41       ` Wei Wooi Peh
     [not found]         ` <CAAcdRmnLizCoq2Y8_x-M7nqb7R=D_WpHXAzdi1nz5S+vtXwY4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-06-26 13:45           ` Paulo Ney de Souza
2018-06-26 14:03           ` Pablo Rodríguez
2018-06-28 12:36   ` Christophe Demko
     [not found]     ` <6ca16ea7-1a04-4c2e-8856-e75b5465d853-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-06-28 12:51       ` Jeremy Theler
     [not found]         ` <e0252af0901d3a6a413f35a4f5c008ad8a13c41d.camel-24em0bpozeFWk0Htik3J/w@public.gmane.org>
2018-06-28 13:51           ` Shawn H Corey
2018-06-29 15:14           ` Christophe Demko
2018-07-03 12:08   ` CR
     [not found]     ` <cf83dc69-7e1c-44e2-92d0-003de00c4a96-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-07-03 21:27       ` BP Jonsson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).