I reckon you can do most cases just with the direction-breaking character, and I've got a custom keymap with a shortcut for doing it. Because so many apps now implement the bidi algorithm, I can see in my text editor whether it's working correctly there, with some confidence that the same output will occur elsewhere (if the copy/paste preserves the unicode characters).

But yes, it's all much easier than in the past - thankfully I never had to do what you did back then!

On Friday, June 9, 2017 at 12:13:00 PM UTC+1, BP wrote:
On occasion I have used some of the many Unicode bracket pairs as delimiters and replaced them with the corresponding LaTeX or HTML markup by regular expression or by making them active in XeLaTeX. They're minimally invasive but can be a pain to type, and it taxes the memory to remember which is which. I would end up defining a code snippet where I anyway would type something like `..grc TEXT<tab>` and then I think I can just as well have the same snippet expand to a span. It's so much less invasive now that Pandoc supports bracketed spans. As always how you see things depends on where you are coming from. Back in the early nineties I had to make my own 8-bit fonts and do the necessary incantations to make them work with LaTeX, and I wrote my own script which sorted and marked up word indexes, a must have in comparative philology. In fact I still use a descendant of it; things are both easier and harder with Unicode!


Den 9 jun 2017 10:03 skrev "Lyndon Drake" <lyn...@arotau.com>:
I did once know Haskell, but these days it makes my head hurt a bit. I'll have a go later this summer once exams are out of the way. I think it's feasible (partly because I've seen it implemented elsewhere) but I could easily be wrong, in which case I'll fall back on the marked spans. Your filter does at least make them minimally invasive. I just don't like the way it clutters up the text-only view with markup, which is the number one reason I'm trying to use Markdown instead of just using LaTeX directly.

Thanks again for the help.

On Thursday, June 8, 2017 at 6:20:11 PM UTC+1, BP wrote:
The marked spans approach suits me well as in my nook of the woods everything but Greek and modern Cyrillic languages is usually cited in romanization.

Setting the language based on Unicode ranges with a Pandoc filter is probably rather hard, since Pandoc splits text content into lists of alternating Str elements, containing the non-whitespace parts, and Space and LineBreak elements representing the whitespace parts. You will need to locate all places in the AST where there is such a list, step through the list looking for Str elements containing characters from one of the scripts you are interested in and enclose sequences of alternating script character Str elements and whitespace elements in Span elements with appropriate attributes or raw markup elements. It is possible that this might be done more or less efficiently with Haskell -- I don't know. If you like me don't know Haskell it is going to be tricky. If you also want the resulting markup to look pretty you also need to take such things as embedded emphasis elements into consideration, not to speak of Str elements with mixed scripts if any. I tried a poor man's substitute once, locating Greek portions in Markdown source with a regular expression, but it was too hard to handle punctuation sufficiently elegantly (Greek portions starting and ending with punctuation in particular).

You mentioned the ucharclasses package which may be helpful when producing PDF. My experience is again that punctuation inside an otherscript portion is problematic, at least if the other script uses punctuation from general punctuation and/or ASCII, and it still leaves you out of luck when producing HTML. It is still best practice to wrap portions with other directionality in elements with a dir attribute, not to mention making sure that portions in other languages are put in divs with a lang tag, not least for the benefit of those using assistive technologies.

Den 8 jun 2017 12:14 skrev "Lyndon Drake" <lyn...-S8RYeTzMgQ3QT0dZR+AlfA@public.gmane.org>:
That's very helpful, thank you!

I haven't actually rewritten the LaTeX template preamble, just moved things around a bit, retaining all the Pandoc variables but making the package load order robust. I think I'll probably slightly update the template, for two more things:

1. it's nice to have access to two places for header insertions, one near the start and the other near the end of the header material once all the packages have been loaded;

2. the memoir class works best for books when there are \frontmatter, \mainmatter, and \endmatter calls in the document, so I will add options and a test to allow those.

I like the span syntax you've got, though in the medium term I want to work on an automatic filter for setting the language based on Unicode ranges and just use spans to cover unusual cases (although in fact, there is almost no case that cannot be covered by the use of Unicode direction-breaking marks or spaces, which have the advantage of working without any markup in HTML in modern browsers with font fallback mechanisms).


On Wednesday, June 7, 2017 at 5:23:54 PM UTC+1, BP Jonsson wrote:
Den 2017-06-06 kl. 15:26, skrev Lyndon Drake:
> Thanks for this. I'd come to the conclusion that writing a latex file and
> including the fragments that pandoc generates might be the way forward, but
> I'm also curious to know what I've been doing wrong.

You don't need to write the whole preamble by hand, just the part
where you load and configure polyglossia and define the fonts
needed for polyglossia.
put them in a file called for example `poly.ltx` and then run
Pandoc with

````
pandoc -H poly.ltx --latex-engine=xelatex
````

I *think* this will also make the bidi bug go away. The
polyglossia package loads the bidi package if needed, but bidi
wants to be loaded after a lot of other packages which it performs
keyhole surgery on, including longtable and even hyperref. However
Pandoc's latex template, loads polyglossia quite early,
alternatively to loading babel. There may be no other way to fix
that than to use a custom template where polyglossia is loaded
quite late, perhaps even after the header-includes, lest the
latter also load some package which bidi wants to be loaded before
itself. I have made such a template
(<https://gist.github.com/bpj/5cebc975685134145cd74ca8670b1ccc>)
If it solves the problem please let me know and I'll make a pull
request for the change.

My custom template also includes my fontspec hack which lets you
declare font families in your metadata like this:

````
font-families:
   - name: '\<language>font'
     font:    <Font Name>
     options:
       - <key>='<value>'
   - name: '\greekfont'
     font:    GFS Neohellenic
     options:
       - Language=Greek
       - Script=Greek
       - Scale=MatchLowercase
       - Ligatures=TeX
   - name: '\sanskritfont'
     font:    Sahadeva
     options:
       - Language=Sanskrit
       - Script=Devanagari
   - name: '\myfancyfont'
     font: My Fancy
````


>
> No rush of course, but I'm keen to have a look at your filter and see what
> it does,

It is now documented and uploaded:

<https://gist.github.com/bpj/02de1ed87ff8f8d0c31a43b9dcac1c80>

(Scroll down for the rendered documentation. The first code block
should suffice to understand how it works.)

It takes some initial configuration but that should be reusable by
including a separate YAML file on the command line with the actual
document.




even without docs. I've also found another filter on the list back
> in 2014 from Jesse Rosenthal that looks at Unicode ranges and wraps them in
> a latex environment, which seems like a good idea (I've done this kind of
> thing in InDesign grep styles and it works well for most normal bits of
> text).
>
> I found the lang/otherlangs documentation, but couldn't figure out from the
> manual (might just be overlooking the correct bit) how to set a div or a
> span for another language.
>
> Part of the problem is that if I set lang and otherlangs as follows:
>
>    lang: en-GB
>    otherlangs: [he, sy]
>    
> I get this:
>
> ! Package bidi Error: Oops! you have loaded package longtable after bidi
> packag
>
> e. Please load package longtable before bidi package, and then try to run
> xelat
>
> ex on your document again.
>
>
> See the bidi package documentation for explanation.
>
> Type  H <return>  for immediate help.
>
>   ...
>
>                                                    
>
> l.72 \begin{document}
>
>
> pandoc: Error producing PDF
>
>
> which I guess means that some kind of strange interaction in the latex
> template is producing an undesirable latex file to feed to xelatex (maybe
> pandoc-csv2table is doing something to the produced latex?). But it kind of
> put a stop to me experimenting with the spans and divs.
>
> Best,
> Lyndon
>
> On Tuesday, June 6, 2017 at 1:08:48 PM UTC+1, BPJ wrote:
>>
>> You need to use the lang and otherlang variables as described in the
>> manual http://pandoc.org/MANUAL if I recall correctly.
>>
>> Alternatively/additionally write a latex file containing a preamble
>> fragment where you load polyglossia and any languages and fonts you need
>> with the options you need in the usual polyglossia/fontspec way and include
>> it with the -H option. You also need to mark spans/divs containing extra
>> languages with lang and dir attributes as appropriate. Use your browser's
>> page search function to find these terms in the manual.
>>
>> I saw your other question about font/language switching yesterday and
>> started to write some documentation for the filter I use to make those
>> things easier. Alas I couldn't finish and today there is a national holiday
>> in Sweden. I'll get back to it tomorrow. Basically you can use spans with a
>> single short class like .g for greek and the filter will inject  latex
>> markup, docx custom style names or extended (html) attributes you have
>> declared to correspond to the class in your metadata.
>>
>> I can comfort you that you are much better off than I was when I started
>> doing multilingual work with Pandoc. We had no filters, no native spans or
>> divs and no built-in multilingual/polyglossia support back then. Everything
>> had to be done in -H files and with raw latex in the markdown, which was a
>> pain because I needed to make things available in HTML as well.
>>
>> I'll also update my latex template on github which contains some stuff for
>> fontspec font loading.
>>
>> I hope this helps. I'm afraid I won't be able to check my mail for the
>> rest of the day.
>>
>>
>> tis 6 juni 2017 kl. 09:02 skrev Lyndon Drake <lyn...@arotau.com
>> <javascript:>>:
>>
>>> Sorry, I probably wasn't clear: I followed the instruction from Pandoc
>>> and switched to xelatex. Now I'm stuck trying to configure the language
>>> options.
>>>
>>>
>>> On Tuesday, June 6, 2017 at 7:41:32 AM UTC+1, BP wrote:
>>>
>>>> You need  the --latex-engine=xelatex option.
>>>>
>>>> tis 6 juni 2017 kl. 07:53 skrev Lyndon Drake <lyn...-S8RYeTzMgQ3QT0dZR+AlfA@public.gmane.org>:
>>>>
>>> Hi all,
>>>>>
>>>>> Many apologies as I'm sure this is all obvious once one knows, but I'm
>>>>> a bit stuck. I've got some Pandoc Markdown files which I'm trying to
>>>>> convert to PDF using Pandoc. They include various non-ascii characters, all
>>>>> in unicode. If I run:
>>>>>
>>>>> /usr/local/bin/pandoc -f
>>>>> markdown+pipe_tables+grid_tables+yaml_metadata_block --filter
>>>>> pandoc-citeproc --filter pandoc-csv2table -s -o formatted/Draft3.pdf
>>>>> text/metadata.yaml text/1-Introduction.md
>>>>>
>>>>> I get the following:
>>>>>
>>>>> ! Package inputenc Error: Unicode char ṣ (U+1E63)
>>>>>
>>>>> (inputenc)                not set up for use with LaTeX.
>>>>>
>>>>>
>>>>> See the inputenc package documentation for explanation.
>>>>>
>>>>> Type  H <return>  for immediate help.
>>>>>
>>>>>   ...
>>>>>
>>>>>                                                    
>>>>>
>>>>> l.125   Vandenhoeck \& Ruprecht, 1990), 39--62.}
>>>>>
>>>>>
>>>>> Try running pandoc with --latex-engine=xelatex.
>>>>>
>>>>> pandoc: Error producing PDF
>>>>>
>>>>>
>>>>> So the next step was to switch to xelatex based on the helpful
>>>>> suggestion from pandoc. As long as I don't try to use any babel or
>>>>> polyglossia environments, or biblatex, this works fine. But as I want to
>>>>> use both, I'm a bit stuck. First thing is that it looks like the default
>>>>> template tries to use babel rather than polyglossia if xetex is the engine.
>>>>> Is there a reason for this? (I want to use the biblatex-sbl style for my
>>>>> bibliography, and they recommend polyglossia.)
>>>>>
>>>>> I want to use English (UK) as my main language, with Hebrew and Syriac
>>>>> as other languages (I've also got some ancient Greek, but the main font
>>>>> I've chosen works fine and the output looks good for that without using a
>>>>> separate language environment).
>>>>>
>>>>> As a starting point, what language options do I set in my YAML metadata
>>>>> to enable those other two language environments, and how do I specify the
>>>>> fonts for them?
>>>>>
>>>>> Here's my YAML metadata file so far:
>>>>>
>>>>> ---
>>>>>    author: Lyndon Drake
>>>>>    documentclass: memoir
>>>>>    toc: true
>>>>>    papersize: a4
>>>>>    fontsize: 12pt
>>>>>    top-level-division: chapter
>>>>>    number-sections: true
>>>>>    mainfont: Skolar PE Light
>>>>>    mainfontoptions: Numbers=OldStyle
>>>>>    bibliography: /Users/lyndon/Documents/Media/Bibliography/0lib.bib
>>>>>    csl:
>>>>> /Users/lyndon/Documents/Media/Bibliography/society-of-biblical-literature-fullnote-bibliography.csl
>>>>>    notes-after-punctuation: true
>>>>> ---
>>>>>
>>>>> Many thanks in advance for any help on this,
>>>>> Lyndon
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "pandoc-discuss" group.
>>>>>
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to pandoc-discus...@googlegroups.com.
>>>>> To post to this group, send email to pandoc-...@googlegroups.com.
>>>>
>>>>
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/pandoc-discuss/89122680-f883-4853-a97f-a81861395b78%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/pandoc-discuss/89122680-f883-4853-a97f-a81861395b78%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "pandoc-discuss" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to pandoc-discus...@googlegroups.com <javascript:>.
>>> To post to this group, send email to pandoc-...@googlegroups.com
>>> <javascript:>.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/pandoc-discuss/99f9330e-3a82-4a0d-8bb4-4ec2513723fe%40googlegroups.com
>>> <https://groups.google.com/d/msgid/pandoc-discuss/99f9330e-3a82-4a0d-8bb4-4ec2513723fe%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discus...@googlegroups.com.
To post to this group, send email to pandoc-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/5cd094b1-ccba-4c6c-a83b-d0fa2ebf39ba%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discus...@googlegroups.com.
To post to this group, send email to pandoc-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/a5d78535-d049-49c9-a77b-ddfa3b226302%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/abcf4c23-e7ac-4eea-b214-6d5bb1a53019%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.