public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: "ph...-jvooPmwWovGx7wk7FZxnQA@public.gmane.org" <phil-jvooPmwWovGx7wk7FZxnQA@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: Syntax highlighting for English as a natural language
Date: Tue, 8 Dec 2020 02:45:10 -0800 (PST)	[thread overview]
Message-ID: <c81b271c-5fa5-4e97-bf9b-a111efcc26d6n@googlegroups.com> (raw)
In-Reply-To: <CADAJKhAK=EWS0gTgqL=WpXTpWEK-AAKKASfxwgU4MSLgXtU=ig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 3751 bytes --]

You could maybe use a Python filter and import the NLTK tools into that: it 
has functions to recognise parts of speech. But as BP points out it will be 
fairly complex and is not guaranteed to be 100% correct.

On Tuesday, 8 December 2020 at 09:23:58 UTC BP wrote:

> That would mean natural language parsing and natural language parsing is 
> hard, not to mention CPU and memory intensive, and needs a huge database (a 
> tagged dictionary). You would need a largish external program for that, 
> unless you are content with highlighting a few closed (i
> .e. having a smallish finite set of members) word classes like pronouns 
> and prepositions. Telling nouns, adjectives and verbs apart in English is 
> very hard since often the same form can be two or three of these depending 
> on context, so you basically need to do a semantic parse of the whole 
> sentence with grammatical syntax, which is non-trivial. If you try to 
> translate from English into a language which clearly distinguishes two or 
> all three of verbs/nouns/adjectives you will see that it often fails. Also 
> just knowing the word class of a word is of limited usefulness. You will 
> generally want to include syntactic information like what is the subject, 
> direct object, indirect object, main verb, auxiliary verb etc. Not easy. 
> Been there, done that. Admittedly it was nearly two decades ago and the 
> techniques have probably at least partly changed/improved since then, but 
> that does not necessarily mean that they have become less resource 
> intensive. There is a reason why tasks like translation are commonly done 
> server-side rather than client-side. There is also a reason why computer 
> programs are written in unambiguous artificial languages rather than 
> ambiguous natural languages and all natural languages are ambiguous; human 
> brains handle ambiguity well while computers basically suck at it. Natural 
> language processing is 90% disambiguation.
>
> Sorry if this is disappointing but such is the complexity of the task.
>
> -- 
> Better --help|less than helpless
>
> Den tis 8 dec. 2020 05:54R (Chandra) Chandrasekhar <chya...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 
> skrev:
>
>> I am looking at highlighting parts of speech in English within a 
>> suitably demarcated code block in Markdown.
>>
>> For example:
>>
>> ```{english}
>> I use Pandoc.
>>
>> ```
>>
>> where:
>>
>> _I_ is given a unique colour for pronouns
>> _use_ is given a unique colour for verbs
>> _Pandoc_ is given a unique colour for nouns.
>>
>> I fear that something that does this generically might require natural 
>> language processing (NLP) capabilities.
>>
>> Since my application is small, I am happy to hand-code individual 
>> colours for individual words.
>>
>> How would I accomplish this in a fashion that allows me to use the same 
>> source for HTML5 and XeLaTeX/PDF outputs?
>>
>> Thanks.
>>
>> Chandra
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "pandoc-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/pandoc-discuss/9a7f850d-7335-074b-1527-8a578ed13741%40gmail.com
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/c81b271c-5fa5-4e97-bf9b-a111efcc26d6n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 5279 bytes --]

  parent reply	other threads:[~2020-12-08 10:45 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-08  4:53 R (Chandra) Chandrasekhar
     [not found] ` <9a7f850d-7335-074b-1527-8a578ed13741-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2020-12-08  9:23   ` BPJ
     [not found]     ` <CADAJKhAK=EWS0gTgqL=WpXTpWEK-AAKKASfxwgU4MSLgXtU=ig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-12-08 10:45       ` ph...-jvooPmwWovGx7wk7FZxnQA@public.gmane.org [this message]
     [not found]         ` <c81b271c-5fa5-4e97-bf9b-a111efcc26d6n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-12-08 13:47           ` R (Chandra) Chandrasekhar
     [not found]             ` <ea4fc1d6-e5a7-27c1-43b5-4a421b23d285-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2020-12-08 13:52               ` AW: " denis.maier-FfwAq0itz3ofv37vnLkPlQ
2020-12-08 12:59       ` R (Chandra) Chandrasekhar
     [not found]         ` <c1222fc6-4f1c-bb5e-c10b-ed44a3056067-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2020-12-08 14:38           ` Daniel Staal
     [not found]             ` <67d88ac1-b144-2591-dd43-7e1afbf5bb89-Jdbf3xiKgS8@public.gmane.org>
2020-12-09  8:58               ` R (Chandra) Chandrasekhar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c81b271c-5fa5-4e97-bf9b-a111efcc26d6n@googlegroups.com \
    --to=phil-jvoopmwwovgx7wk7fzxnqa@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).