public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* Mardownify HTML entities
@ 2018-12-22  7:41 JM Marcastel
       [not found] ` <765e1c40-e292-4731-9618-3cd9a1438df7-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: JM Marcastel @ 2018-12-22  7:41 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 3241 bytes --]

HTML entities are nicely handled by Pandoc.

All https://dev.w3.org/html5/html-author/charref characters are supported, 
from a Pandoc perspective (cf. cross-browser support 
<https://webmasters.stackexchange.com/questions/65820/html-character-entities-special-characters-with-crossbrowser-crossplatform-suppo> for 
browser-related Unicode support).

Nonetheless would it be possible to transform HTML entities into Markdown 
entities?

That is, rather than immediately delegate the handling of entities to an 
HTML parser, let the Markdown parser handle entities and, if desired, place 
them in the AST as RawBlock.This allows to handle entities not recognised 
by the HTML parser. As a by-product, this allows writers to have their own 
implementation of Pandoc's --ascii option.

The motivation behind this is to use the SGML entity syntax to complement's 
Pandoc DIV and SPAN Markdown syntax with font handling capabilities. Though 
DIV and SPAN markup in Markdown is very much HTML tainted, we can use the 
AST output in non-HTML outputs (I also use them in LaTeX outputs). Likewise 
the conventional HTML entity syntax could be used to support not only HTML 
entities, but more generally any font.

Example. Imagine a document's source in Markdown which is output as both an 
HTML page and a LaTeX/PDF document. Imagine the use of FontAwesome icons -- 
e.g. ambulance. One could of course use the very-HTMLish syntax in Markdown []{.fa 
.fa-ambulance}. My suggestion here is to support this through HTML entity 
markup like so &fa:ambulance;.

The entity in our example is not an HTML entity -- so tagsoup cannot handle 
it. It will be output in the AST as a string (concatenated to whatever 
non-whitespace characters prefixed or suffixed it). Rather than 
this behaviour, my suggestion here, is that it be isolated as an 
independent entry in the AST, either as an html-tagged RawBlock, or even 
better, as a new token type (RawBlock->entity).

From a previous thread 
<https://groups.google.com/forum/#!searchin/pandoc-discuss/entities|sort:date/pandoc-discuss/uLVoRNcrRRI/vHHCVehKCAAJ> I 
understand that this is not trivial. Nonetheless, I believe this very 
nicely complements the aforementioned DIV and SPAN capabilities allowing to 
easily support iconographic fonts in Markdown documents, without 
re-inventing a new syntax while keeping the existing markup relatively 
readable by human eyes!

Note: I mention here FontAwesome. This is an example, and by no means an 
indication that FontAwesome should be supported. What font should be 
supported is a writer consideration. From the AST perspective, the only 
concern is isolating the Markdown entity.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/765e1c40-e292-4731-9618-3cd9a1438df7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 4245 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Mardownify HTML entities
       [not found] ` <765e1c40-e292-4731-9618-3cd9a1438df7-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2018-12-22 16:16   ` mb21
       [not found]     ` <a87c8f1c-fa06-4817-ac7c-4bd170f70219-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2019-01-01 20:35   ` John MacFarlane
  1 sibling, 1 reply; 7+ messages in thread
From: mb21 @ 2018-12-22 16:16 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 3792 bytes --]


>
> &fa:ambulance;. [...] will be output in the AST as a string (concatenated 
> to whatever non-whitespace characters prefixed or suffixed it). Rather than 
> this behaviour, my suggestion here, is that it be isolated as an 
> independent entry in the AST


You can match on Str "&fa:ambulance;" in a lua-filter etc. already now. 
What is the advantage with the proposed approach?


On Saturday, December 22, 2018 at 8:41:06 AM UTC+1, JM Marcastel wrote:
>
> HTML entities are nicely handled by Pandoc.
>
> All https://dev.w3.org/html5/html-author/charref characters are 
> supported, from a Pandoc perspective (cf. cross-browser support 
> <https://webmasters.stackexchange.com/questions/65820/html-character-entities-special-characters-with-crossbrowser-crossplatform-suppo> for 
> browser-related Unicode support).
>
> Nonetheless would it be possible to transform HTML entities into Markdown 
> entities?
>
> That is, rather than immediately delegate the handling of entities to an 
> HTML parser, let the Markdown parser handle entities and, if desired, place 
> them in the AST as RawBlock.This allows to handle entities not recognised 
> by the HTML parser. As a by-product, this allows writers to have their own 
> implementation of Pandoc's --ascii option.
>
> The motivation behind this is to use the SGML entity syntax to 
> complement's Pandoc DIV and SPAN Markdown syntax with font handling 
> capabilities. Though DIV and SPAN markup in Markdown is very much HTML 
> tainted, we can use the AST output in non-HTML outputs (I also use them in 
> LaTeX outputs). Likewise the conventional HTML entity syntax could be used 
> to support not only HTML entities, but more generally any font.
>
> Example. Imagine a document's source in Markdown which is output as both 
> an HTML page and a LaTeX/PDF document. Imagine the use of FontAwesome icons 
> -- e.g. ambulance. One could of course use the very-HTMLish syntax in 
> Markdown []{.fa .fa-ambulance}. My suggestion here is to support this 
> through HTML entity markup like so &fa:ambulance;.
>
> The entity in our example is not an HTML entity -- so tagsoup cannot 
> handle it. It will be output in the AST as a string (concatenated to 
> whatever non-whitespace characters prefixed or suffixed it). Rather than 
> this behaviour, my suggestion here, is that it be isolated as an 
> independent entry in the AST, either as an html-tagged RawBlock, or even 
> better, as a new token type (RawBlock->entity).
>
> From a previous thread 
> <https://groups.google.com/forum/#!searchin/pandoc-discuss/entities%7Csort:date/pandoc-discuss/uLVoRNcrRRI/vHHCVehKCAAJ> I 
> understand that this is not trivial. Nonetheless, I believe this very 
> nicely complements the aforementioned DIV and SPAN capabilities allowing to 
> easily support iconographic fonts in Markdown documents, without 
> re-inventing a new syntax while keeping the existing markup relatively 
> readable by human eyes!
>
> Note: I mention here FontAwesome. This is an example, and by no means an 
> indication that FontAwesome should be supported. What font should be 
> supported is a writer consideration. From the AST perspective, the only 
> concern is isolating the Markdown entity.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/a87c8f1c-fa06-4817-ac7c-4bd170f70219%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 6679 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Mardownify HTML entities
       [not found]     ` <a87c8f1c-fa06-4817-ac7c-4bd170f70219-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2018-12-22 19:36       ` BP Jonsson
  2018-12-25  4:26       ` JM Marcastel
  1 sibling, 0 replies; 7+ messages in thread
From: BP Jonsson @ 2018-12-22 19:36 UTC (permalink / raw)
  To: pandoc-discuss

Den 2018-12-22 kl. 17:16, skrev mb21:
> 
>>
>> &fa:ambulance;. [...] will be output in the AST as a string (concatenated
>> to whatever non-whitespace characters prefixed or suffixed it). Rather than
>> this behaviour, my suggestion here, is that it be isolated as an
>> independent entry in the AST
> 
> 
> You can match on Str "&fa:ambulance;" in a lua-filter etc. already now.
> What is the advantage with the proposed approach?

Let's not forget codespans!  Writing `` `ambulance`{.fa} `` in the 
Markdown and have it intercepted by a trivial Lua filter seems 
like a very viable alternative to me --- at least as viable as 
using custom entity names with as many or more characters.  I have 
done the equivalent with [X-SAMPA][] instead of Unicode phonetic 
transcription when I had to use a phone text editor with limited 
Unicode input capabilities.  In both cases the real problem is 
that *somewhere* you have to have a library or at least a table 
mapping from entities or their equivalent to Unicode characters. 
Since my filter was written in Perl I could use a rather capable 
Perl library to translate from X-SAMPA to Unicode.  Without that 
it doesn't matter a whole lot how I present the entities or their 
equivalents to Pandoc as long as it is reasonably easy to retrieve 
the placeholder in the AST with a filter.  That said it would 
probably be useful if Pandoc stored unrecognised entities as HTML 
RawInline elements rather than baking them into Str elements as it 
does now, as that would make it very much easier for a filter to 
retrieve them.

[X-SAMPA]: https://en.wikipedia.org/wiki/X-SAMPA

> 
> 
> On Saturday, December 22, 2018 at 8:41:06 AM UTC+1, JM Marcastel wrote:
>>
>> HTML entities are nicely handled by Pandoc.
>>
>> All https://dev.w3.org/html5/html-author/charref characters are
>> supported, from a Pandoc perspective (cf. cross-browser support
>> <https://webmasters.stackexchange.com/questions/65820/html-character-entities-special-characters-with-crossbrowser-crossplatform-suppo> for
>> browser-related Unicode support).
>>
>> Nonetheless would it be possible to transform HTML entities into Markdown
>> entities?
>>
>> That is, rather than immediately delegate the handling of entities to an
>> HTML parser, let the Markdown parser handle entities and, if desired, place
>> them in the AST as RawBlock.This allows to handle entities not recognised
>> by the HTML parser. As a by-product, this allows writers to have their own
>> implementation of Pandoc's --ascii option.
>>
>> The motivation behind this is to use the SGML entity syntax to
>> complement's Pandoc DIV and SPAN Markdown syntax with font handling
>> capabilities. Though DIV and SPAN markup in Markdown is very much HTML
>> tainted, we can use the AST output in non-HTML outputs (I also use them in
>> LaTeX outputs). Likewise the conventional HTML entity syntax could be used
>> to support not only HTML entities, but more generally any font.
>>
>> Example. Imagine a document's source in Markdown which is output as both
>> an HTML page and a LaTeX/PDF document. Imagine the use of FontAwesome icons
>> -- e.g. ambulance. One could of course use the very-HTMLish syntax in
>> Markdown []{.fa .fa-ambulance}. My suggestion here is to support this
>> through HTML entity markup like so &fa:ambulance;.
>>
>> The entity in our example is not an HTML entity -- so tagsoup cannot
>> handle it. It will be output in the AST as a string (concatenated to
>> whatever non-whitespace characters prefixed or suffixed it). Rather than
>> this behaviour, my suggestion here, is that it be isolated as an
>> independent entry in the AST, either as an html-tagged RawBlock, or even
>> better, as a new token type (RawBlock->entity).
>>
>>  From a previous thread
>> <https://groups.google.com/forum/#!searchin/pandoc-discuss/entities%7Csort:date/pandoc-discuss/uLVoRNcrRRI/vHHCVehKCAAJ> I
>> understand that this is not trivial. Nonetheless, I believe this very
>> nicely complements the aforementioned DIV and SPAN capabilities allowing to
>> easily support iconographic fonts in Markdown documents, without
>> re-inventing a new syntax while keeping the existing markup relatively
>> readable by human eyes!
>>
>> Note: I mention here FontAwesome. This is an example, and by no means an
>> indication that FontAwesome should be supported. What font should be
>> supported is a writer consideration. From the AST perspective, the only
>> concern is isolating the Markdown entity.
>>
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Mardownify HTML entities
       [not found]     ` <a87c8f1c-fa06-4817-ac7c-4bd170f70219-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2018-12-22 19:36       ` BP Jonsson
@ 2018-12-25  4:26       ` JM Marcastel
       [not found]         ` <72583D50-676B-4B1C-8F2A-6499D93E475A-hh8AyDY1G20S+FvcfC7Uqw@public.gmane.org>
  1 sibling, 1 reply; 7+ messages in thread
From: JM Marcastel @ 2018-12-25  4:26 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 2036 bytes --]

The idea is not so much on how to implement a feature, but more on adding markup syntax for an aliasing capability in Pandoc-flavoured Markdown.

Because SGML entities are just that, aliasing — DocBook authors know this well.

HTML entities, a targeted subset of SGML entities are an ageing solution — but ever so practical, for Unicode support.

HTML entities are finite. And can consequently be known and handled by Pandoc. As is the case today.

Markdown entities would be infinite. They adhere to the SGML reference syntax and are recognised as a new token type in the AST (or eventually as an HTML raw block).

Since they are recognised, you don’t need to intercept the AST and post process it before building your output.

A Pandoc writer cannot see HTML entities, they will have been converted to Unicode upstream.

Pandoc writers would see Markdown entities. They are cleanly marked in the AST and don’t need subsequent processing to be correctly parsed, they only need to be processed.

A writer no knowing how to process an entity should behave like an HTML browser with an unknown tag… ignore it.

HTML entities target Unicode. Markdown entities would simply be aliases for anything. I mentioned fonts, but that was a simplification.

As you see, using an existing markup syntax — such as DIVs, SPANs, or code blocks,  with attributes, is not what I target… though easy enough to code.

Merry Christmas to all


-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/72583D50-676B-4B1C-8F2A-6499D93E475A%40marcastel.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2.1: Type: text/html, Size: 3531 bytes --]

[-- Attachment #2.2: 201612091413024181.png --]
[-- Type: image/png, Size: 972827 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Mardownify HTML entities
       [not found]         ` <72583D50-676B-4B1C-8F2A-6499D93E475A-hh8AyDY1G20S+FvcfC7Uqw@public.gmane.org>
@ 2018-12-27 17:54           ` mb21
       [not found]             ` <b7f67390-6eb9-424b-b5f3-5e0acf624574-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: mb21 @ 2018-12-27 17:54 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 776 bytes --]

> Markdown entities would simply be aliases for anything

People usually indeed use just unicode characters, spans or divs, or 
something like this lua filter: https://stackoverflow.com/a/53372026/214446

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/b7f67390-6eb9-424b-b5f3-5e0acf624574%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 1216 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Mardownify HTML entities
       [not found]             ` <b7f67390-6eb9-424b-b5f3-5e0acf624574-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2018-12-29  1:57               ` JM Marcastel
  0 siblings, 0 replies; 7+ messages in thread
From: JM Marcastel @ 2018-12-29  1:57 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 1424 bytes --]

> 
> People usually indeed use just unicode characters, spans or divs,

Entities are limited to Unicode in the HTML world, certainly not in other SGML-derived syntaxes, such as DocBook.

Markdown was originally designed with HTML generation in mind, not Pandoc flavoured Markdown.

People also need abbreviations, text snippets for repeated text, and more.

For instance abbreviations à-la PHP Markdown Extra, currently recognised but skipped by Pandoc when the abbreviations extension is set.

For instance also, a means of reusing metadata content without conflicting with the Liquid-like syntax in templates.

Rather than re-inventing the wheel with a new syntax, as exposed in the referenced Lua script, the motivation is to re-use existing markup (even if currently delegated to a third party library during the parsing phase).

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/11DC7C29-08A6-4A9C-B3CF-3F247A6478B4%40marcastel.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 2842 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Mardownify HTML entities
       [not found] ` <765e1c40-e292-4731-9618-3cd9a1438df7-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2018-12-22 16:16   ` mb21
@ 2019-01-01 20:35   ` John MacFarlane
  1 sibling, 0 replies; 7+ messages in thread
From: John MacFarlane @ 2019-01-01 20:35 UTC (permalink / raw)
  To: JM Marcastel, pandoc-discuss


For related discussion, see
https://github.com/commonmark/CommonMark/issues/442
and the thread linked in the first comment there.

One issue came up when I tried to implement this:

Ideally, one would leave entities alone in link
titles, rather than converting them to characters, at
least if that's what one is doing generally. But we
really can't do that, since link titles are
represented in the AST as plain strings (not sequences
of inline nodes).


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-01-01 20:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-22  7:41 Mardownify HTML entities JM Marcastel
     [not found] ` <765e1c40-e292-4731-9618-3cd9a1438df7-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-12-22 16:16   ` mb21
     [not found]     ` <a87c8f1c-fa06-4817-ac7c-4bd170f70219-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-12-22 19:36       ` BP Jonsson
2018-12-25  4:26       ` JM Marcastel
     [not found]         ` <72583D50-676B-4B1C-8F2A-6499D93E475A-hh8AyDY1G20S+FvcfC7Uqw@public.gmane.org>
2018-12-27 17:54           ` mb21
     [not found]             ` <b7f67390-6eb9-424b-b5f3-5e0acf624574-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-12-29  1:57               ` JM Marcastel
2019-01-01 20:35   ` John MacFarlane

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).