public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* Copy-pasting code from the PDF loses formatting
@ 2022-01-06  9:10 Robert Fekete
       [not found] ` <a976bf18-7019-43cf-84c2-0a2d375cef55n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Robert Fekete @ 2022-01-06  9:10 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1296 bytes --]

Hi Everyone, 

I'm trying to create PDF output from HTML input, and ran into a weird error:

Code samples (for example, YAML or Python) are properly formatted in the 
pdf, but most of the formatting is lost when copy-pasting the code from the 
PDF into a text editor or terminal. Depending on the PDF viewer, either:

   - line breaks are retained, but indentation is lost (evince, preview, 
   adobe reader), or
   - line breaks are lost and everything becomes a single line, but 
   whitespace is retained (built-in pdf viewer of Firefox and VS Code)

I'm currently using pandoc 2.14.2 on MacOS Big Sur.

I have attached two test files (input and output), I created the pdf with 
the wkhtml2pdf engine, but I've tested other engines as well and the 
results were similar (xelatex, weasyprint). 

Has anyone seen a similar problem? Any pointers are appreciated.

Kind Regards,
Robert

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/a976bf18-7019-43cf-84c2-0a2d375cef55n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 1781 bytes --]

[-- Attachment #2: test-2.html --]
[-- Type: text/html, Size: 2769 bytes --]

[-- Attachment #3: test-2.pdf --]
[-- Type: application/pdf, Size: 26120 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Copy-pasting code from the PDF loses formatting
       [not found] ` <a976bf18-7019-43cf-84c2-0a2d375cef55n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2022-01-06 13:59   ` Leonard Rosenthol
  2022-01-06 14:50     ` Robert Fekete
  0 siblings, 1 reply; 3+ messages in thread
From: Leonard Rosenthol @ 2022-01-06 13:59 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 2520 bytes --]

Robert - the reason why none of the viewers are copyring out indentation is
that there isn't actually indentation there (aka no spaces are tab
characters), the text is simply "moved".    Normally PDF viewers are able
to apply heuristics to "guess" when the amount of "movement" is supposed to
mean indentation - but this particular amount of "movement" is too small
for consideration.  If you make the indent say 4 spaces worth instead of 2,
I suspect you will get the result you wish.

On Thu, Jan 6, 2022 at 4:10 AM Robert Fekete <fekete77.robert-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:

> Hi Everyone,
>
> I'm trying to create PDF output from HTML input, and ran into a weird
> error:
>
> Code samples (for example, YAML or Python) are properly formatted in the
> pdf, but most of the formatting is lost when copy-pasting the code from the
> PDF into a text editor or terminal. Depending on the PDF viewer, either:
>
>    - line breaks are retained, but indentation is lost (evince, preview,
>    adobe reader), or
>    - line breaks are lost and everything becomes a single line, but
>    whitespace is retained (built-in pdf viewer of Firefox and VS Code)
>
> I'm currently using pandoc 2.14.2 on MacOS Big Sur.
>
> I have attached two test files (input and output), I created the pdf with
> the wkhtml2pdf engine, but I've tested other engines as well and the
> results were similar (xelatex, weasyprint).
>
> Has anyone seen a similar problem? Any pointers are appreciated.
>
> Kind Regards,
> Robert
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/a976bf18-7019-43cf-84c2-0a2d375cef55n%40googlegroups.com
> <https://groups.google.com/d/msgid/pandoc-discuss/a976bf18-7019-43cf-84c2-0a2d375cef55n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CALu%3Dv3LO_f8GBNxwre9mTrMT%2BMttf6-b4eA45iKS1SUb8vSs%3DQ%40mail.gmail.com.

[-- Attachment #2: Type: text/html, Size: 3514 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Copy-pasting code from the PDF loses formatting
  2022-01-06 13:59   ` Leonard Rosenthol
@ 2022-01-06 14:50     ` Robert Fekete
  0 siblings, 0 replies; 3+ messages in thread
From: Robert Fekete @ 2022-01-06 14:50 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 3033 bytes --]

Hi Leonard,

Thanks a lot for the tip, unfortunately it doesn't seem to solve the 
problem, but I'll play with it some more. Is there any way to force this, 
maybe from the HTML side, like replacing spaces with tabs? (Sorry if this 
doesn't make sense, I don't know much about the inner workings of the PDF 
format).

Leonard Rosenthol a következőt írta (2022. január 6., csütörtök, 15:00:08 
UTC+1):

> Robert - the reason why none of the viewers are copyring out indentation 
> is that there isn't actually indentation there (aka no spaces are tab 
> characters), the text is simply "moved".    Normally PDF viewers are able 
> to apply heuristics to "guess" when the amount of "movement" is supposed to 
> mean indentation - but this particular amount of "movement" is too small 
> for consideration.  If you make the indent say 4 spaces worth instead of 2, 
> I suspect you will get the result you wish.
>
> On Thu, Jan 6, 2022 at 4:10 AM Robert Fekete <fekete7...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>> Hi Everyone, 
>>
>> I'm trying to create PDF output from HTML input, and ran into a weird 
>> error:
>>
>> Code samples (for example, YAML or Python) are properly formatted in the 
>> pdf, but most of the formatting is lost when copy-pasting the code from the 
>> PDF into a text editor or terminal. Depending on the PDF viewer, either:
>>
>>    - line breaks are retained, but indentation is lost (evince, preview, 
>>    adobe reader), or
>>    - line breaks are lost and everything becomes a single line, but 
>>    whitespace is retained (built-in pdf viewer of Firefox and VS Code)
>>
>> I'm currently using pandoc 2.14.2 on MacOS Big Sur.
>>
>> I have attached two test files (input and output), I created the pdf with 
>> the wkhtml2pdf engine, but I've tested other engines as well and the 
>> results were similar (xelatex, weasyprint). 
>>
>> Has anyone seen a similar problem? Any pointers are appreciated.
>>
>> Kind Regards,
>> Robert
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "pandoc-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/pandoc-discuss/a976bf18-7019-43cf-84c2-0a2d375cef55n%40googlegroups.com 
>> <https://groups.google.com/d/msgid/pandoc-discuss/a976bf18-7019-43cf-84c2-0a2d375cef55n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/d82e995e-040c-44ae-9658-211660d69887n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 4541 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-01-06 14:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-06  9:10 Copy-pasting code from the PDF loses formatting Robert Fekete
     [not found] ` <a976bf18-7019-43cf-84c2-0a2d375cef55n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-01-06 13:59   ` Leonard Rosenthol
2022-01-06 14:50     ` Robert Fekete

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).