On Sunday, January 29, 2017 at 8:38:34 AM UTC-5, BP Jonsson wrote: > > The problem with doing language tagging with a filter in the current AST > model is that you do want to include inter-word spaces in the span but > exclude some punctuation, like brackets while including other punctuation > like dashes and quotes. Terminal punctuation (period, comma, (semi)colon, > exclamation/question mark) is very difficult: it should be included if the > whole sentence or paragraph is Greek but excluded if only one or two words > in an otherwise non-Greek sentence are Greek. I have an ugly hack of a perl > script which currently does a half decent job of tagging parts of a text > file based on Unicode scripts with a monster regular expression. It can > even be told to skip code (inline and block), LaTeX environments and > commands and HTML tags, and spurred by this thread I spent some time > yesterday trying to improve it by skipping the parenthesized part of inline > links, but the results of running this script still need manual checking so > that it works almost as well or better to just search character spans with > certain Unicode ranges in a capable editor. > Exactly. I was wondering if maybe using regex to grab in-line Greek that flanked either by quotation marks or spaces would work. (That's obviously oversimplified.) And all-Greek block quotes would also be included. > > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/757afbd5-1002-4f39-965f-149e8def2deb%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.