public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: BP Jonsson <bpjonsson-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
Subject: Re: Ancient Greek in output PDF
Date: Sun, 29 Jan 2017 14:38:32 +0100	[thread overview]
Message-ID: <CAFC_yuQLyx_dUoMkY32OXT=-CmX9US5veLZdo87mvoD-OaUfzA@mail.gmail.com> (raw)
In-Reply-To: <e2f300d1-b5dd-41df-b81f-11d05516bc00-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 3122 bytes --]

The problem with doing language tagging with a filter in the current AST
model is that you do want to include inter-word spaces in the span but
exclude some punctuation, like brackets while including other punctuation
like dashes and quotes. Terminal punctuation (period, comma, (semi)colon,
exclamation/question mark) is very difficult: it should be included if the
whole sentence or paragraph is Greek but excluded if only one or two words
in an otherwise non-Greek sentence are Greek. I have an ugly hack of a perl
script which currently does a half decent job of tagging parts of a text
file based on Unicode scripts with a monster regular expression. It can
even be told to skip code (inline and block), LaTeX environments and
commands and HTML tags, and spurred by this thread I spent some time
yesterday trying to improve it by skipping the parenthesized part of inline
links, but the results of running this script still need manual checking so
that it works almost as well or better to just search character spans with
certain Unicode ranges in a capable editor.

For languages written in the same script even more sophisticated language
detection software easily fail. For example with Swedish and Icelandic more
than half of the words in running text could be either language, but they
need different hyphenation rules.

/bpj

Den 28 jan 2017 20:29 skrev "Andrew Dunning" <andunning-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:

> It works fine if,you set your main font to something including Greek, e.g.
> Brill. All you're missing in that case is correct hyphenation.
> Alternatively, you can set a font by Unicode range rather than language (it
> doesn't do this automatically).
>
> It strikes me that what we really need is a filter that would tag
> languages in Pandoc output based on best guesses (Word does this to some
> extent already). Should theoretically be possible with the language
> detection libraries out there.
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/pandoc-discuss/e2f300d1-b5dd-41df-b81f-11d05516bc00%
> 40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFC_yuQLyx_dUoMkY32OXT%3D-CmX9US5veLZdo87mvoD-OaUfzA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 4446 bytes --]

  parent reply	other threads:[~2017-01-29 13:38 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-19 16:08 John Muccigrosso
     [not found] ` <45028ac1-86cf-40ee-b3b8-7560eb9dce47-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-01-19 19:14   ` John MacFarlane
     [not found]     ` <20170119191417.GD96884-l/d5Ua9yGnxXsXJlQylH7w@public.gmane.org>
2017-01-20  4:33       ` John Muccigrosso
     [not found]         ` <20f5af55-daf5-443c-9c5f-085b4b816ce1-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-01-20  9:04           ` John MacFarlane
2017-01-24 17:36   ` Pablo Rodríguez
     [not found]     ` <91f1b15e-860d-8b35-1451-aca9fdf4ec45-S0/GAf8tV78@public.gmane.org>
2017-01-24 18:53       ` John Muccigrosso
     [not found]         ` <67ea3fe7-0d6b-4ad2-8010-d69295cba0b8-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-01-25  8:41           ` Scot Mcphee
2017-01-24 19:15       ` John Muccigrosso
     [not found]         ` <631d8a2d-12a0-4dbd-ad98-6988e9db6d54-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-01-24 19:49           ` Pablo Rodríguez
2017-01-25 17:46           ` BP Jonsson
     [not found]             ` <920a0f68-a164-53aa-b3bb-224307772512-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-01-26  0:50               ` John Muccigrosso
     [not found]                 ` <736ec0e9-a251-4e0c-90c6-b9e2c314532b-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-01-26 13:32                   ` Melroch
2017-01-27 17:02                   ` Pablo Rodríguez
     [not found]                     ` <ec531451-39f0-109f-c64f-be594acedab1-S0/GAf8tV78@public.gmane.org>
2017-01-28 18:52                       ` John Muccigrosso
     [not found]                         ` <3177084b-b814-4bf5-859f-c74650df775a-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-01-28 19:29                           ` Andrew Dunning
     [not found]                             ` <e2f300d1-b5dd-41df-b81f-11d05516bc00-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-01-29 13:38                               ` BP Jonsson [this message]
     [not found]                                 ` <CAFC_yuQLyx_dUoMkY32OXT=-CmX9US5veLZdo87mvoD-OaUfzA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-01-29 16:53                                   ` John Muccigrosso
2017-01-29 16:51                               ` John Muccigrosso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAFC_yuQLyx_dUoMkY32OXT=-CmX9US5veLZdo87mvoD-OaUfzA@mail.gmail.com' \
    --to=bpjonsson-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).