Syntax highlighting for English as a natural language

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* Syntax highlighting for English as a natural language
@ 2020-12-08  4:53 R (Chandra) Chandrasekhar
       [not found] ` <9a7f850d-7335-074b-1527-8a578ed13741-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: R (Chandra) Chandrasekhar @ 2020-12-08  4:53 UTC (permalink / raw)
  To: pandoc-discuss

I am looking at highlighting parts of speech in English within a 
suitably demarcated code block in Markdown.

For example:

```{english}
I use Pandoc.

```

where:

_I_ is given a unique colour for pronouns
_use_ is given a unique colour for verbs
_Pandoc_ is given a unique colour for nouns.

I fear that something that does this generically might require natural 
language processing (NLP) capabilities.

Since my application is small, I am happy to hand-code individual 
colours for individual words.

How would I accomplish this in a fashion that allows me to use the same 
source for HTML5 and XeLaTeX/PDF outputs?

Thanks.

Chandra

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Syntax highlighting for English as a natural language
       [not found] ` <9a7f850d-7335-074b-1527-8a578ed13741-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2020-12-08  9:23   ` BPJ
       [not found]     ` <CADAJKhAK=EWS0gTgqL=WpXTpWEK-AAKKASfxwgU4MSLgXtU=ig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: BPJ @ 2020-12-08  9:23 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 3383 bytes --]

That would mean natural language parsing and natural language parsing is
hard, not to mention CPU and memory intensive, and needs a huge database (a
tagged dictionary). You would need a largish external program for that,
unless you are content with highlighting a few closed (i
.e. having a smallish finite set of members) word classes like pronouns and
prepositions. Telling nouns, adjectives and verbs apart in English is very
hard since often the same form can be two or three of these depending on
context, so you basically need to do a semantic parse of the whole sentence
with grammatical syntax, which is non-trivial. If you try to translate from
English into a language which clearly distinguishes two or all three of
verbs/nouns/adjectives you will see that it often fails. Also just knowing
the word class of a word is of limited usefulness. You will generally want
to include syntactic information like what is the subject, direct object,
indirect object, main verb, auxiliary verb etc. Not easy. Been there, done
that. Admittedly it was nearly two decades ago and the techniques have
probably at least partly changed/improved since then, but that does not
necessarily mean that they have become less resource intensive. There is a
reason why tasks like translation are commonly done server-side rather than
client-side. There is also a reason why computer programs are written in
unambiguous artificial languages rather than ambiguous natural languages
and all natural languages are ambiguous; human brains handle ambiguity well
while computers basically suck at it. Natural language processing is 90%
disambiguation.

Sorry if this is disappointing but such is the complexity of the task.

-- 
Better --help|less than helpless

Den tis 8 dec. 2020 05:54R (Chandra) Chandrasekhar <chyavana-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
skrev:

> I am looking at highlighting parts of speech in English within a
> suitably demarcated code block in Markdown.
>
> For example:
>
> ```{english}
> I use Pandoc.
>
> ```
>
> where:
>
> _I_ is given a unique colour for pronouns
> _use_ is given a unique colour for verbs
> _Pandoc_ is given a unique colour for nouns.
>
> I fear that something that does this generically might require natural
> language processing (NLP) capabilities.
>
> Since my application is small, I am happy to hand-code individual
> colours for individual words.
>
> How would I accomplish this in a fashion that allows me to use the same
> source for HTML5 and XeLaTeX/PDF outputs?
>
> Thanks.
>
> Chandra
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/9a7f850d-7335-074b-1527-8a578ed13741%40gmail.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhAK%3DEWS0gTgqL%3DWpXTpWEK-AAKKASfxwgU4MSLgXtU%3Dig%40mail.gmail.com.

[-- Attachment #2: Type: text/html, Size: 4467 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Syntax highlighting for English as a natural language
       [not found]     ` <CADAJKhAK=EWS0gTgqL=WpXTpWEK-AAKKASfxwgU4MSLgXtU=ig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2020-12-08 10:45       ` ph...-jvooPmwWovGx7wk7FZxnQA@public.gmane.org
       [not found]         ` <c81b271c-5fa5-4e97-bf9b-a111efcc26d6n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2020-12-08 12:59       ` R (Chandra) Chandrasekhar
  1 sibling, 1 reply; 8+ messages in thread
From: ph...-jvooPmwWovGx7wk7FZxnQA@public.gmane.org @ 2020-12-08 10:45 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 3751 bytes --]

You could maybe use a Python filter and import the NLTK tools into that: it 
has functions to recognise parts of speech. But as BP points out it will be 
fairly complex and is not guaranteed to be 100% correct.

On Tuesday, 8 December 2020 at 09:23:58 UTC BP wrote:

> That would mean natural language parsing and natural language parsing is 
> hard, not to mention CPU and memory intensive, and needs a huge database (a 
> tagged dictionary). You would need a largish external program for that, 
> unless you are content with highlighting a few closed (i
> .e. having a smallish finite set of members) word classes like pronouns 
> and prepositions. Telling nouns, adjectives and verbs apart in English is 
> very hard since often the same form can be two or three of these depending 
> on context, so you basically need to do a semantic parse of the whole 
> sentence with grammatical syntax, which is non-trivial. If you try to 
> translate from English into a language which clearly distinguishes two or 
> all three of verbs/nouns/adjectives you will see that it often fails. Also 
> just knowing the word class of a word is of limited usefulness. You will 
> generally want to include syntactic information like what is the subject, 
> direct object, indirect object, main verb, auxiliary verb etc. Not easy. 
> Been there, done that. Admittedly it was nearly two decades ago and the 
> techniques have probably at least partly changed/improved since then, but 
> that does not necessarily mean that they have become less resource 
> intensive. There is a reason why tasks like translation are commonly done 
> server-side rather than client-side. There is also a reason why computer 
> programs are written in unambiguous artificial languages rather than 
> ambiguous natural languages and all natural languages are ambiguous; human 
> brains handle ambiguity well while computers basically suck at it. Natural 
> language processing is 90% disambiguation.
>
> Sorry if this is disappointing but such is the complexity of the task.
>
> -- 
> Better --help|less than helpless
>
> Den tis 8 dec. 2020 05:54R (Chandra) Chandrasekhar <chya...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 
> skrev:
>
>> I am looking at highlighting parts of speech in English within a 
>> suitably demarcated code block in Markdown.
>>
>> For example:
>>
>> ```{english}
>> I use Pandoc.
>>
>> ```
>>
>> where:
>>
>> _I_ is given a unique colour for pronouns
>> _use_ is given a unique colour for verbs
>> _Pandoc_ is given a unique colour for nouns.
>>
>> I fear that something that does this generically might require natural 
>> language processing (NLP) capabilities.
>>
>> Since my application is small, I am happy to hand-code individual 
>> colours for individual words.
>>
>> How would I accomplish this in a fashion that allows me to use the same 
>> source for HTML5 and XeLaTeX/PDF outputs?
>>
>> Thanks.
>>
>> Chandra
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "pandoc-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/pandoc-discuss/9a7f850d-7335-074b-1527-8a578ed13741%40gmail.com
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/c81b271c-5fa5-4e97-bf9b-a111efcc26d6n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 5279 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Syntax highlighting for English as a natural language
       [not found]     ` <CADAJKhAK=EWS0gTgqL=WpXTpWEK-AAKKASfxwgU4MSLgXtU=ig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2020-12-08 10:45       ` ph...-jvooPmwWovGx7wk7FZxnQA@public.gmane.org
@ 2020-12-08 12:59       ` R (Chandra) Chandrasekhar
       [not found]         ` <c1222fc6-4f1c-bb5e-c10b-ed44a3056067-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 8+ messages in thread
From: R (Chandra) Chandrasekhar @ 2020-12-08 12:59 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw, BPJ

Thank you, BPJ.

I am aware of the difficulties with a natural language.

I came across this site which accomplishes that in English, quite 
impressively:

https://english.edward.io/

However, I am not aware how the site-author did it.

Coming back to my question, my set of sentences to code in colour, 
depending on its part of speech, is small and I can do it by hand. But I 
need to do it in a fashion that allows the same pandoc-markdown source 
file to produce PDF and HTML.

For example, the sentence "I use Pandoc" might show up as:

I (in blue colour for pronouns)
use (in green colour for verbs)
Pandoc (in red colour for nouns)

How would I do this in pandoc-markdown (perhaps in a verbatim-like 
environment) to get the same results for HTML and PDF?

Thanks.

Chandra

On 08/12/2020 14:53, BPJ wrote:
> That would mean natural language parsing and natural language parsing is 
> hard, not to mention CPU and memory intensive, and needs a huge database 
> (a tagged dictionary). You would need a largish external program for 
> that, unless you are content with highlighting a few closed (i
> .e. having a smallish finite set of members) word classes like pronouns 
> and prepositions. Telling nouns, adjectives and verbs apart in English 
> is very hard since often the same form can be two or three of these 
> depending on context, so you basically need to do a semantic parse of 
> the whole sentence with grammatical syntax, which is non-trivial. If you 
> try to translate from English into a language which clearly 
> distinguishes two or all three of verbs/nouns/adjectives you will see 
> that it often fails. Also just knowing the word class of a word is of 
> limited usefulness. You will generally want to include syntactic 
> information like what is the subject, direct object, indirect object, 
> main verb, auxiliary verb etc. Not easy. Been there, done that. 
> Admittedly it was nearly two decades ago and the techniques have 
> probably at least partly changed/improved since then, but that does not 
> necessarily mean that they have become less resource intensive. There is 
> a reason why tasks like translation are commonly done server-side rather 
> than client-side. There is also a reason why computer programs are 
> written in unambiguous artificial languages rather than ambiguous 
> natural languages and all natural languages are ambiguous; human brains 
> handle ambiguity well while computers basically suck at it. Natural 
> language processing is 90% disambiguation.
> 
> Sorry if this is disappointing but such is the complexity of the task.
> 
> -- 
> Better --help|less than helpless
> 
> Den tis 8 dec. 2020 05:54R (Chandra) Chandrasekhar <chyavana-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org 
> <mailto:chyavana-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> skrev:
> 
>     I am looking at highlighting parts of speech in English within a
>     suitably demarcated code block in Markdown.
> 
>     For example:
> 
>     ```{english}
>     I use Pandoc.
> 
>     ```
> 
>     where:
> 
>     _I_ is given a unique colour for pronouns
>     _use_ is given a unique colour for verbs
>     _Pandoc_ is given a unique colour for nouns.
> 
>     I fear that something that does this generically might require natural
>     language processing (NLP) capabilities.
> 
>     Since my application is small, I am happy to hand-code individual
>     colours for individual words.
> 
>     How would I accomplish this in a fashion that allows me to use the same
>     source for HTML5 and XeLaTeX/PDF outputs?
> 
>     Thanks.
> 
>     Chandra
> 
>     -- 
>     You received this message because you are subscribed to the Google
>     Groups "pandoc-discuss" group.
>     To unsubscribe from this group and stop receiving emails from it,
>     send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>     <mailto:pandoc-discuss%2Bunsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
>     To view this discussion on the web visit
>     https://groups.google.com/d/msgid/pandoc-discuss/9a7f850d-7335-074b-1527-8a578ed13741%40gmail.com
>     <https://groups.google.com/d/msgid/pandoc-discuss/9a7f850d-7335-074b-1527-8a578ed13741%40gmail.com>.
> 
> -- 
> You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org 
> <mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhAK%3DEWS0gTgqL%3DWpXTpWEK-AAKKASfxwgU4MSLgXtU%3Dig%40mail.gmail.com 
> <https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhAK%3DEWS0gTgqL%3DWpXTpWEK-AAKKASfxwgU4MSLgXtU%3Dig%40mail.gmail.com?utm_medium=email&utm_source=footer>.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Syntax highlighting for English as a natural language
       [not found]         ` <c81b271c-5fa5-4e97-bf9b-a111efcc26d6n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-12-08 13:47           ` R (Chandra) Chandrasekhar
       [not found]             ` <ea4fc1d6-e5a7-27c1-43b5-4a421b23d285-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: R (Chandra) Chandrasekhar @ 2020-12-08 13:47 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw,
	ph...-jvooPmwWovGx7wk7FZxnQA@public.gmane.org

Is NLTK a Python toolkit?

I was very impressed by what I saw and tried out at this website:

https://english.edward.io/

In any case, my requirements are for a very small application in which I 
can custom-handcode the colours but I wish to use the same Markdown 
source to get both HTML and PDF.

Chandra

On 08/12/2020 16:15, ph...-jvooPmwWovGx7wk7FZxnQA@public.gmane.org wrote:
> You could maybe use a Python filter and import the NLTK tools into that: 
> it has functions to recognise parts of speech. But as BP points out it 
> will be fairly complex and is not guaranteed to be 100% correct.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* AW: Syntax highlighting for English as a natural language
       [not found]             ` <ea4fc1d6-e5a7-27c1-43b5-4a421b23d285-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2020-12-08 13:52               ` denis.maier-FfwAq0itz3ofv37vnLkPlQ
  0 siblings, 0 replies; 8+ messages in thread
From: denis.maier-FfwAq0itz3ofv37vnLkPlQ @ 2020-12-08 13:52 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

> Is NLTK a Python toolkit?

Yes, from their [website]( http://www.nltk.org/): "NLTK is a leading platform for building Python programs to work with human language data."


> On 08/12/2020 16:15, ph...-jvooPmwWovGx7wk7FZxnQA@public.gmane.org wrote:
> > You could maybe use a Python filter and import the NLTK tools into that:
> > it has functions to recognise parts of speech. But as BP points out it
> > will be fairly complex and is not guaranteed to be 100% correct.
> 
> 
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/ea4fc1d6-e5a7-27c1-
> 43b5-4a421b23d285%40gmail.com.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Syntax highlighting for English as a natural language
       [not found]         ` <c1222fc6-4f1c-bb5e-c10b-ed44a3056067-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2020-12-08 14:38           ` Daniel Staal
       [not found]             ` <67d88ac1-b144-2591-dd43-7e1afbf5bb89-Jdbf3xiKgS8@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Staal @ 2020-12-08 14:38 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On 12/8/20 7:59 AM, R (Chandra) Chandrasekhar wrote:
> 
> Coming back to my question, my set of sentences to code in colour, 
> depending on its part of speech, is small and I can do it by hand. But I 
> need to do it in a fashion that allows the same pandoc-markdown source 
> file to produce PDF and HTML.
> 
> For example, the sentence "I use Pandoc" might show up as:
> 
> I (in blue colour for pronouns)
> use (in green colour for verbs)
> Pandoc (in red colour for nouns)
> 
> How would I do this in pandoc-markdown (perhaps in a verbatim-like 
> environment) to get the same results for HTML and PDF?

If you are willing to do all the work manually, and only need it in HTML 
and PDF: Use spans and CSS, and use HTML as the intermediary to PDF 
instead of LaTex.   Either native HTML spans, or bracketed spans.

Daniel T. Staal

-- 
---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Syntax highlighting for English as a natural language
       [not found]             ` <67d88ac1-b144-2591-dd43-7e1afbf5bb89-Jdbf3xiKgS8@public.gmane.org>
@ 2020-12-09  8:58               ` R (Chandra) Chandrasekhar
  0 siblings, 0 replies; 8+ messages in thread
From: R (Chandra) Chandrasekhar @ 2020-12-09  8:58 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw, Daniel Staal



On 08/12/2020 20:08, Daniel Staal wrote:
> On 12/8/20 7:59 AM, R (Chandra) Chandrasekhar wrote:
>>
>> Coming back to my question, my set of sentences to code in colour, 
>> depending on its part of speech, is small and I can do it by hand. But 
>> I need to do it in a fashion that allows the same pandoc-markdown 
>> source file to produce PDF and HTML.
>>
>> For example, the sentence "I use Pandoc" might show up as:
>>
>> I (in blue colour for pronouns)
>> use (in green colour for verbs)
>> Pandoc (in red colour for nouns)
>>
>> How would I do this in pandoc-markdown (perhaps in a verbatim-like 
>> environment) to get the same results for HTML and PDF?
> 
> If you are willing to do all the work manually, and only need it in HTML 
> and PDF: Use spans and CSS, and use HTML as the intermediary to PDF 
> instead of LaTex.   Either native HTML spans, or bracketed spans.
> 
> Daniel T. Staal
> 

That is what I ended up doing and it is quite satisfactory.

Thanks also for mentioning HTML to PDF: That is an option that did not 
occur to me as I was fixated on LaTeX.

Chandra

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/5f60478d-2419-f154-9a37-b3df33c4e25e%40gmail.com.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-12-09  8:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-08  4:53 Syntax highlighting for English as a natural language R (Chandra) Chandrasekhar
     [not found] ` <9a7f850d-7335-074b-1527-8a578ed13741-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2020-12-08  9:23   ` BPJ
     [not found]     ` <CADAJKhAK=EWS0gTgqL=WpXTpWEK-AAKKASfxwgU4MSLgXtU=ig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-12-08 10:45       ` ph...-jvooPmwWovGx7wk7FZxnQA@public.gmane.org
     [not found]         ` <c81b271c-5fa5-4e97-bf9b-a111efcc26d6n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-12-08 13:47           ` R (Chandra) Chandrasekhar
     [not found]             ` <ea4fc1d6-e5a7-27c1-43b5-4a421b23d285-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2020-12-08 13:52               ` AW: " denis.maier-FfwAq0itz3ofv37vnLkPlQ
2020-12-08 12:59       ` R (Chandra) Chandrasekhar
     [not found]         ` <c1222fc6-4f1c-bb5e-c10b-ed44a3056067-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2020-12-08 14:38           ` Daniel Staal
     [not found]             ` <67d88ac1-b144-2591-dd43-7e1afbf5bb89-Jdbf3xiKgS8@public.gmane.org>
2020-12-09  8:58               ` R (Chandra) Chandrasekhar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).