public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: BPJ <melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: Syntax highlighting for English as a natural language
Date: Tue, 8 Dec 2020 10:23:40 +0100	[thread overview]
Message-ID: <CADAJKhAK=EWS0gTgqL=WpXTpWEK-AAKKASfxwgU4MSLgXtU=ig@mail.gmail.com> (raw)
In-Reply-To: <9a7f850d-7335-074b-1527-8a578ed13741-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 3383 bytes --]

That would mean natural language parsing and natural language parsing is
hard, not to mention CPU and memory intensive, and needs a huge database (a
tagged dictionary). You would need a largish external program for that,
unless you are content with highlighting a few closed (i
.e. having a smallish finite set of members) word classes like pronouns and
prepositions. Telling nouns, adjectives and verbs apart in English is very
hard since often the same form can be two or three of these depending on
context, so you basically need to do a semantic parse of the whole sentence
with grammatical syntax, which is non-trivial. If you try to translate from
English into a language which clearly distinguishes two or all three of
verbs/nouns/adjectives you will see that it often fails. Also just knowing
the word class of a word is of limited usefulness. You will generally want
to include syntactic information like what is the subject, direct object,
indirect object, main verb, auxiliary verb etc. Not easy. Been there, done
that. Admittedly it was nearly two decades ago and the techniques have
probably at least partly changed/improved since then, but that does not
necessarily mean that they have become less resource intensive. There is a
reason why tasks like translation are commonly done server-side rather than
client-side. There is also a reason why computer programs are written in
unambiguous artificial languages rather than ambiguous natural languages
and all natural languages are ambiguous; human brains handle ambiguity well
while computers basically suck at it. Natural language processing is 90%
disambiguation.

Sorry if this is disappointing but such is the complexity of the task.

-- 
Better --help|less than helpless

Den tis 8 dec. 2020 05:54R (Chandra) Chandrasekhar <chyavana-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
skrev:

> I am looking at highlighting parts of speech in English within a
> suitably demarcated code block in Markdown.
>
> For example:
>
> ```{english}
> I use Pandoc.
>
> ```
>
> where:
>
> _I_ is given a unique colour for pronouns
> _use_ is given a unique colour for verbs
> _Pandoc_ is given a unique colour for nouns.
>
> I fear that something that does this generically might require natural
> language processing (NLP) capabilities.
>
> Since my application is small, I am happy to hand-code individual
> colours for individual words.
>
> How would I accomplish this in a fashion that allows me to use the same
> source for HTML5 and XeLaTeX/PDF outputs?
>
> Thanks.
>
> Chandra
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/9a7f850d-7335-074b-1527-8a578ed13741%40gmail.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhAK%3DEWS0gTgqL%3DWpXTpWEK-AAKKASfxwgU4MSLgXtU%3Dig%40mail.gmail.com.

[-- Attachment #2: Type: text/html, Size: 4467 bytes --]

  parent reply	other threads:[~2020-12-08  9:23 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-08  4:53 R (Chandra) Chandrasekhar
     [not found] ` <9a7f850d-7335-074b-1527-8a578ed13741-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2020-12-08  9:23   ` BPJ [this message]
     [not found]     ` <CADAJKhAK=EWS0gTgqL=WpXTpWEK-AAKKASfxwgU4MSLgXtU=ig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-12-08 10:45       ` ph...-jvooPmwWovGx7wk7FZxnQA@public.gmane.org
     [not found]         ` <c81b271c-5fa5-4e97-bf9b-a111efcc26d6n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-12-08 13:47           ` R (Chandra) Chandrasekhar
     [not found]             ` <ea4fc1d6-e5a7-27c1-43b5-4a421b23d285-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2020-12-08 13:52               ` AW: " denis.maier-FfwAq0itz3ofv37vnLkPlQ
2020-12-08 12:59       ` R (Chandra) Chandrasekhar
     [not found]         ` <c1222fc6-4f1c-bb5e-c10b-ed44a3056067-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2020-12-08 14:38           ` Daniel Staal
     [not found]             ` <67d88ac1-b144-2591-dd43-7e1afbf5bb89-Jdbf3xiKgS8@public.gmane.org>
2020-12-09  8:58               ` R (Chandra) Chandrasekhar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CADAJKhAK=EWS0gTgqL=WpXTpWEK-AAKKASfxwgU4MSLgXtU=ig@mail.gmail.com' \
    --to=melroch-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).