Pandoc, XeLaTex, and Hebrew Characters

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* Pandoc, XeLaTex, and Hebrew Characters
@ 2017-06-17 16:35 Alan Storm
       [not found] ` <3d7a8fd7-346d-4c74-ad52-e137c9509719-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Alan Storm @ 2017-06-17 16:35 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1.1: Type: text/plain, Size: 2509 bytes --]

New to the group, and only (so far!) the most casual of pandoc users. I'm 
having some trouble getting pandoc to do things that the xelatex command 
can do.  I'm looking for someone (or someones) that can help get me to a 
solution **and** explain what pandoc is doing behind the scenes so I can 
debug these sort of problems in the future. 

In short -- I have an tex document with some hebrew characters.  If I use 
the  xelatex command directly to convert this document, the hebrew 
characters are rendered correctly in the PDF

    $ xelatex simple.tex 

    However, if I attempt to do the conversion with pandoc using the 
xelatex engine

    $ pandoc --latex-engine=xelatex simple.tex -o from-pandoc.pdf

pandoc will render the PDF **without** the hebrew characters.  There's just 
blank white space where the hebrew should be.

So, first question if anyone knows: How do I make pandoc render the hebrew 
into a PDF?

If there's no clear path to that answer, how to do I debug the 
pandoc/xelatex interaction in order to understand why pandoc's use of the 
engine produces different results, and how I might be able to change that 
invocation.  Regarding this -- my understanding of pandoc is very limited, 
so baby words/steps that let me become less of a baby are appreciated :)   

Also, I have some specific examples posted over on the tex StackExchange 
with hebrew examples

https://tex.stackexchange.com/questions/375380/getting-hebrew-support-working-with-pandoc/375443

https://tex.stackexchange.com/questions/375380/getting-hebrew-support-working-with-pandoc

Finally -- my ultimate goal here is to convert HTML documents with hebrew 
into PDFs.  As that also doesn't work, I'm focused on understanding the 
[tex] to [pdf] conversion.  If there's some extra wrinkle involved with 
[html] to [pdf] ( which I presume -- (correctly?) -- involves xelatex in 
the middle) feel free to chime in there as well. 

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/3d7a8fd7-346d-4c74-ad52-e137c9509719%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 4841 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Pandoc, XeLaTex, and Hebrew Characters
       [not found] ` <3d7a8fd7-346d-4c74-ad52-e137c9509719-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-06-17 18:55   ` Joost Kremers
       [not found]     ` <87tw3efnqr.fsf-97jfqw80gc6171pxa8y+qA@public.gmane.org>
  2017-06-19  5:17   ` Kolen Cheung
  2017-06-20 22:21   ` Alan Storm
  2 siblings, 1 reply; 6+ messages in thread
From: Joost Kremers @ 2017-06-17 18:55 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

<tl;dr>

* Don't use Pandoc to convert .tex files to pdf, use latex/xelatex 
  directly.

* In order to inspect what the LaTeX file looks like that Pandoc 
  sends to xelatex when producing pdf from html, use:

     $ pandoc --latex-engine=xelatex simple.html -o 
     from-pandoc.tex

And look at the resulting .tex file, or run it through xelatex 
yourself and check the logs.

</tl;dr>

On Sat, Jun 17 2017, Alan Storm wrote:
> New to the group, and only (so far!) the most casual of pandoc 
> users. I'm 
> having some trouble getting pandoc to do things that the xelatex 
> command 
> can do.  I'm looking for someone (or someones) that can help get 
> me to a 
> solution **and** explain what pandoc is doing behind the scenes 
> so I can 
> debug these sort of problems in the future. 

Pandoc reads a document in a supported input format, converts it 
to its own internal representation, then converts this 
representation to the desired output format.

If you wish to create a pdf, pandoc creates a LaTeX file (or, 
alternatively, a ConTeXt or html file) and runs it through 
pdflatex / xelatex (or texexec / wkhtmltopdf).

The internal representation is geared toward Markdown, which means 
that other input formats are not necessarily supported 100%, 
especially input formats that are much more expressive than 
Markdown (such as LaTeX). This in turn means that if your input 
format is anything other than Markdown, you may lose things in the 
conversion.

> In short -- I have an tex document with some hebrew characters. 
> If I use 
> the  xelatex command directly to convert this document, the 
> hebrew 
> characters are rendered correctly in the PDF
>
>     $ xelatex simple.tex 
>
>
>     However, if I attempt to do the conversion with pandoc using 
>     the 
> xelatex engine
>
>     $ pandoc --latex-engine=xelatex simple.tex -o 
>     from-pandoc.pdf

If your input document is LaTeX and your target format is pdf, 
then don't use Pandoc. Use latex / xelatex instead. Due to its 
design, using Pandoc for this will usually not give you the result 
you're hoping for.

If you want to convert your document to multiple output formats, 
then your best bet is probably not to write it in LaTeX.

> So, first question if anyone knows: How do I make pandoc render 
> the hebrew 
> into a PDF?
>
> If there's no clear path to that answer,

It really depends on what your input format is. As I said, if it 
is LaTeX, the best way is simply not to use Pandoc at all. ;-)

> how to do I debug the 
> pandoc/xelatex interaction in order to understand why pandoc's 
> use of the 
> engine produces different results, and how I might be able to 
> change that 
> invocation.  Regarding this -- my understanding of pandoc is 
> very limited, 
> so baby words/steps that let me become less of a baby are 
> appreciated :)   

As explained, Pandoc first converts the input document into its 
own internal representation. With LaTeX input, this conversion is 
usually lossy. In order to create a pdf (through LaTeX), Pandoc 
first converts its internal representation into a LaTeX document, 
using its default LaTeX template. This .tex file is then converted 
to pdf using pdflatex, xelatex or lualatex.

There are two places where you can influence this process: you can 
specify a different LaTeX template than the default one, and you 
can run filters on the internal representation. Getting enough 
info into the internal representation in order to be able to run 
your filter(s) to get the desired output may require some tweaking 
of the input document. (For example, Pandoc's internal 
representation has a concept of divs similar to html divs, which 
you could use for such purposes.)

If all of that sounds very vague, then that's because it is. :-) 
The details really depend on what you're trying to do.

> Finally -- my ultimate goal here is to convert HTML documents 
> with hebrew 
> into PDFs.  As that also doesn't work, I'm focused on 
> understanding the 
> [tex] to [pdf] conversion.  If there's some extra wrinkle 
> involved with 
> [html] to [pdf] ( which I presume -- (correctly?) -- involves 
> xelatex in 
> the middle) feel free to chime in there as well. 

If your html-to-pdf conversion used LaTeX (Pandoc also supports 
pdf creation using wkhtmltopdf) you can debug the conversion, by 
specifying a .tex file as output file and look at the LaTeX code 
produced. You can run it through XeLaTeX manually and inspect the 
log if the code itself doesn't tell you what the problem is.

HTH

-- 
Joost Kremers
Life has its moments

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Pandoc, XeLaTex, and Hebrew Characters
       [not found]     ` <87tw3efnqr.fsf-97jfqw80gc6171pxa8y+qA@public.gmane.org>
@ 2017-06-17 22:04       ` John MacFarlane
  0 siblings, 0 replies; 6+ messages in thread
From: John MacFarlane @ 2017-06-17 22:04 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

+++ Joost Kremers [Jun 17 17 20:55 ]:
>
><tl;dr>
>
>* Don't use Pandoc to convert .tex files to pdf, use latex/xelatex  
>directly.
>
>* In order to inspect what the LaTeX file looks like that Pandoc  
>sends to xelatex when producing pdf from html, use:
>
>
>    $ pandoc --latex-engine=xelatex simple.html -o     from-pandoc.tex
>
>And look at the resulting .tex file, or run it through xelatex 
>yourself and check the logs.

This is all good advice, but just a small detail: use the -s
option, thus:

>    $ pandoc -s --latex-engine=xelatex simple.html -o     from-pandoc.tex

This will give you a standalone file you can run xelatex on.
Otherwise you'll get a fragment without a header.

Also:  you can use --verbose with pandoc and it will print
the intermediate latex file it is using to create a PDF,
as well as other information that may help you in debugging.

On the specific issue of fonts, have you set the 'mainfont'
variable to a font with hebrew characters?
(See the manual.)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Pandoc, XeLaTex, and Hebrew Characters
       [not found] ` <3d7a8fd7-346d-4c74-ad52-e137c9509719-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2017-06-17 18:55   ` Joost Kremers
@ 2017-06-19  5:17   ` Kolen Cheung
  2017-06-20 22:21   ` Alan Storm
  2 siblings, 0 replies; 6+ messages in thread
From: Kolen Cheung @ 2017-06-19  5:17 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 842 bytes --]

Another problem you may hit is that Hebrew is Right to Left, and it is kind of non-trivial to get it right. The tricky thing is to tell (Xe)LaTeX that a certain text is Right to Left instead.

If your whole document are in Hebrew, you can probably configure your whole document to be right to left. If not, you need to enclose each use of Hebrew by a command that specifies its language.

I tried using ucharclasses in LaTeX for auto detect this Hebrew boundary, but the problem is the space between each Hebrew word will break the boundary, and hence each Hebrew word will be right to left but the whole sentense will be left to right.

To specify the language of a region, I think pandoc has a native way to do it (see the manual for the lang attribute) and I guess it will translate into the proper LaTeX command (you should test though).

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Pandoc, XeLaTex, and Hebrew Characters
       [not found] ` <3d7a8fd7-346d-4c74-ad52-e137c9509719-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2017-06-17 18:55   ` Joost Kremers
  2017-06-19  5:17   ` Kolen Cheung
@ 2017-06-20 22:21   ` Alan Storm
       [not found]     ` <ab8c9513-f33d-4965-8c3c-a10c855c9a3d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2 siblings, 1 reply; 6+ messages in thread
From: Alan Storm @ 2017-06-20 22:21 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 3554 bytes --]

Thanks all for the help.  The close the loop on this for anyone coming 
along later on. 

1. Despite hearing it from several people, it took me a while to realize 
that pandoc, even when converting from tex to PDF, will still perform it's 
"munge this doc into an internal representation" sub-routines.  This means 
most of what you're doing in LaTeX gets lost, unless pandoc happens to 
understand it

2. Pandoc's default font doesn't support hebrew: 
https://github.com/jgm/pandoc/issues/3742

3. You can specify a font that does support pandoc via `-V mainfont:` : 
https://tex.stackexchange.com/a/375544/4689

4. If you're dealing with a document that has mixed english/hebrew you may 
be out of luck with pure pandoc, as you need to markup which sections 
contain hebrew and which contain english: 
https://tex.stackexchange.com/a/375498/4689

Thanks again for the help, and best wishes!

On Saturday, June 17, 2017 at 9:35:05 AM UTC-7, Alan Storm wrote:
>
> New to the group, and only (so far!) the most casual of pandoc users. I'm 
> having some trouble getting pandoc to do things that the xelatex command 
> can do.  I'm looking for someone (or someones) that can help get me to a 
> solution **and** explain what pandoc is doing behind the scenes so I can 
> debug these sort of problems in the future. 
>
> In short -- I have an tex document with some hebrew characters.  If I use 
> the  xelatex command directly to convert this document, the hebrew 
> characters are rendered correctly in the PDF
>
>     $ xelatex simple.tex 
>
>
>     However, if I attempt to do the conversion with pandoc using the 
> xelatex engine
>
>     $ pandoc --latex-engine=xelatex simple.tex -o from-pandoc.pdf
>
>
> pandoc will render the PDF **without** the hebrew characters.  There's 
> just blank white space where the hebrew should be.
>     
> So, first question if anyone knows: How do I make pandoc render the hebrew 
> into a PDF?
>
> If there's no clear path to that answer, how to do I debug the 
> pandoc/xelatex interaction in order to understand why pandoc's use of the 
> engine produces different results, and how I might be able to change that 
> invocation.  Regarding this -- my understanding of pandoc is very limited, 
> so baby words/steps that let me become less of a baby are appreciated :)   
>     
>     
> Also, I have some specific examples posted over on the tex StackExchange 
> with hebrew examples
>
>
> https://tex.stackexchange.com/questions/375380/getting-hebrew-support-working-with-pandoc/375443
>
>
> https://tex.stackexchange.com/questions/375380/getting-hebrew-support-working-with-pandoc
>
> Finally -- my ultimate goal here is to convert HTML documents with hebrew 
> into PDFs.  As that also doesn't work, I'm focused on understanding the 
> [tex] to [pdf] conversion.  If there's some extra wrinkle involved with 
> [html] to [pdf] ( which I presume -- (correctly?) -- involves xelatex in 
> the middle) feel free to chime in there as well. 
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/ab8c9513-f33d-4965-8c3c-a10c855c9a3d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 6718 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Pandoc, XeLaTex, and Hebrew Characters
       [not found]     ` <ab8c9513-f33d-4965-8c3c-a10c855c9a3d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-06-22  0:01       ` Kolen Cheung
  0 siblings, 0 replies; 6+ messages in thread
From: Kolen Cheung @ 2017-06-22  0:01 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 4252 bytes --]



On Tuesday, June 20, 2017 at 3:21:45 PM UTC-7, Alan Storm wrote:
>
> Thanks all for the help.  The close the loop on this for anyone coming 
> along later on. 
>
> 1. Despite hearing it from several people, it took me a while to realize 
> that pandoc, even when converting from tex to PDF, will still perform it's 
> "munge this doc into an internal representation" sub-routines.  This means 
> most of what you're doing in LaTeX gets lost, unless pandoc happens to 
> understand it
>

This should be well known. Who claimed otherwise?

I think there was another issue/discussion on this before. Basically one 
cannot expect pandoc to act like latexmk for example.
 

>
> 2. Pandoc's default font doesn't support hebrew: 
> https://github.com/jgm/pandoc/issues/3742
>
>
This should be expected. In LaTeX, for unicode (multiple languages) 
sources, one has to be careful on the font choice and transition as well.
 

> 3. You can specify a font that does support pandoc via `-V mainfont:` : 
> https://tex.stackexchange.com/a/375544/4689
>
> 4. If you're dealing with a document that has mixed english/hebrew you may 
> be out of luck with pure pandoc, as you need to markup which sections 
> contain hebrew and which contain english: 
> https://tex.stackexchange.com/a/375498/4689
>
> Depends on what you mean by "pure pandoc". In pandoc's markdown syntax, 
you can natively specific this with the lang attributes (which again is a 
native pandoc extension, not just some HTML syntax).
 

> Thanks again for the help, and best wishes!
>
> On Saturday, June 17, 2017 at 9:35:05 AM UTC-7, Alan Storm wrote:
>>
>> New to the group, and only (so far!) the most casual of pandoc users. I'm 
>> having some trouble getting pandoc to do things that the xelatex command 
>> can do.  I'm looking for someone (or someones) that can help get me to a 
>> solution **and** explain what pandoc is doing behind the scenes so I can 
>> debug these sort of problems in the future. 
>>
>> In short -- I have an tex document with some hebrew characters.  If I use 
>> the  xelatex command directly to convert this document, the hebrew 
>> characters are rendered correctly in the PDF
>>
>>     $ xelatex simple.tex 
>>
>>
>>     However, if I attempt to do the conversion with pandoc using the 
>> xelatex engine
>>
>>     $ pandoc --latex-engine=xelatex simple.tex -o from-pandoc.pdf
>>
>>
>> pandoc will render the PDF **without** the hebrew characters.  There's 
>> just blank white space where the hebrew should be.
>>     
>> So, first question if anyone knows: How do I make pandoc render the 
>> hebrew into a PDF?
>>
>> If there's no clear path to that answer, how to do I debug the 
>> pandoc/xelatex interaction in order to understand why pandoc's use of the 
>> engine produces different results, and how I might be able to change that 
>> invocation.  Regarding this -- my understanding of pandoc is very limited, 
>> so baby words/steps that let me become less of a baby are appreciated :)   
>>     
>>     
>> Also, I have some specific examples posted over on the tex StackExchange 
>> with hebrew examples
>>
>>
>> https://tex.stackexchange.com/questions/375380/getting-hebrew-support-working-with-pandoc/375443
>>
>>
>> https://tex.stackexchange.com/questions/375380/getting-hebrew-support-working-with-pandoc
>>
>> Finally -- my ultimate goal here is to convert HTML documents with hebrew 
>> into PDFs.  As that also doesn't work, I'm focused on understanding the 
>> [tex] to [pdf] conversion.  If there's some extra wrinkle involved with 
>> [html] to [pdf] ( which I presume -- (correctly?) -- involves xelatex in 
>> the middle) feel free to chime in there as well. 
>>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/4cafdf85-a2b5-45aa-b9c0-b37c02508552%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 9555 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-06-22  0:01 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-17 16:35 Pandoc, XeLaTex, and Hebrew Characters Alan Storm
     [not found] ` <3d7a8fd7-346d-4c74-ad52-e137c9509719-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-06-17 18:55   ` Joost Kremers
     [not found]     ` <87tw3efnqr.fsf-97jfqw80gc6171pxa8y+qA@public.gmane.org>
2017-06-17 22:04       ` John MacFarlane
2017-06-19  5:17   ` Kolen Cheung
2017-06-20 22:21   ` Alan Storm
     [not found]     ` <ab8c9513-f33d-4965-8c3c-a10c855c9a3d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-06-22  0:01       ` Kolen Cheung

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).