On Sunday, January 29, 2017 at 8:38:34 AM UTC-5, BP Jonsson wrote:
>
> The problem with doing language tagging with a filter in the current AST 
> model is that you do want to include inter-word spaces in the span but 
> exclude some punctuation, like brackets while including other punctuation 
> like dashes and quotes. Terminal punctuation (period, comma, (semi)colon, 
> exclamation/question mark) is very difficult: it should be included if the 
> whole sentence or paragraph is Greek but excluded if only one or two words 
> in an otherwise non-Greek sentence are Greek. I have an ugly hack of a perl 
> script which currently does a half decent job of tagging parts of a text 
> file based on Unicode scripts with a monster regular expression. It can 
> even be told to skip code (inline and block), LaTeX environments and 
> commands and HTML tags, and spurred by this thread I spent some time 
> yesterday trying to improve it by skipping the parenthesized part of inline 
> links, but the results of running this script still need manual checking so 
> that it works almost as well or better to just search character spans with 
> certain Unicode ranges in a capable editor.
>

Exactly. I was wondering if maybe using regex to grab in-line Greek that 
flanked either by quotation marks or spaces would work. (That's obviously 
oversimplified.) And all-Greek block quotes would also be included. 

>  
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/757afbd5-1002-4f39-965f-149e8def2deb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.