public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* output html entities?
@ 2011-01-15 19:14 Mark (my words)
  2011-01-15 21:38 ` John MacFarlane
  0 siblings, 1 reply; 10+ messages in thread
From: Mark (my words) @ 2011-01-15 19:14 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 733 bytes --]

Heya,

I’ve noticed that pandoc markdown to html does not use html entities. em 
dashes, en dashes, ellipses, and quotation marks are output as actual 
characters.

This happens regardless of the input format, for example: —, --, —, 
are all converted to —.

Is there a way to have pandoc output HTML entities?


Mark 

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.


[-- Attachment #2: Type: text/html, Size: 882 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: output html entities?
  2011-01-15 19:14 output html entities? Mark (my words)
@ 2011-01-15 21:38 ` John MacFarlane
       [not found]   ` <20110115213823.GA42115-8pRoOo7FpFbPoI0UbmHJ02ZHpeb/A1Y/@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: John MacFarlane @ 2011-01-15 21:38 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

+++ Mark (my words) [Jan 15 11 11:14 ]:
>    Heya,
> 
>    Ive noticed that pandoc markdown to html does not use html entities. em
>    dashes, en dashes, ellipses, and quotation marks are output as actual
>    characters.
> 
>    This happens regardless of the input format, for example: &mdash;, --,
>    , are all converted to .
> 
>    Is there a way to have pandoc output HTML entities?

Pandoc converts all entities to unicode characters.
Your HTML output should be in UTF-8, and this should render just fine
provided you specify the charset (or use the `-s` option with the
default template).  Why do you need to output entities?

John


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: output html entities?
       [not found]   ` <20110115213823.GA42115-8pRoOo7FpFbPoI0UbmHJ02ZHpeb/A1Y/@public.gmane.org>
@ 2011-01-16  0:46     ` Mark (my words)
  2011-01-16  2:59       ` John MacFarlane
  2011-01-16 15:28       ` Bruce
  0 siblings, 2 replies; 10+ messages in thread
From: Mark (my words) @ 2011-01-16  0:46 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 1089 bytes --]

My apologies to John for replying to him personally instead of the group as 
I intended. Here’s my response so other’s can comment.

It causes havoc with html validators, some websites and comment
systems don’t handle unicode, in some fonts there is no discernible
difference between a hyphen and an en or em dash, it makes, for me,
editing the html tougher at a later date, and it’s a personal
preference in html code I’d much rather see &hellip; then a character
that without mousing over I can’t tell is an actual ellipsis or
separate periods, and, too often for my liking, switching systems and
apps can cause the unicode to be displayed as some crazy east european
diacritics.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.


[-- Attachment #2: Type: text/html, Size: 1241 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: output html entities?
  2011-01-16  0:46     ` Mark (my words)
@ 2011-01-16  2:59       ` John MacFarlane
  2011-01-16 15:28       ` Bruce
  1 sibling, 0 replies; 10+ messages in thread
From: John MacFarlane @ 2011-01-16  2:59 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

If people wanted this feature, I could add an --ascii option that would cause
the HTML reader to emit entities instead of non-ascii characters.
(Later I could add something similar in LaTeX, though this would only
be partial as there is no *general* way, akin to numerical character
references, of representing unicode characters in LaTeX.)

To be clear, this would mean that ALL non-ascii characters get represented as
entities, even things like accented vowels, regardless of whether you
entered them as entities or with unicode characters.

So, for example, "ä &hellip;" would be rendered with --ascii as
"&auml; &hellip;", and by default as "ä …". Character entity references
would be used when available, otherwise numerical.

Thoughts?

John

+++ Mark (my words) [Jan 15 11 16:46 ]:
>    My apologies to John for replying to him personally instead of the
>    group as I intended. Heres my response so others can comment.
> 
>    It causes havoc with html validators, some websites and comment
> 
>    systems dont handle unicode, in some fonts there is no discernible
> 
>    difference between a hyphen and an en or em dash, it makes, for me,
> 
>    editing the html tougher at a later date, and its a personal
> 
>    preference in html code Id much rather see &hellip; then a character
> 
>    that without mousing over I cant tell is an actual ellipsis or
> 
>    separate periods, and, too often for my liking, switching systems and
> 
>    apps can cause the unicode to be displayed as some crazy east european
> 
>    diacritics.
> 
>    --
>    You received this message because you are subscribed to the Google
>    Groups "pandoc-discuss" group.
>    To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>    To unsubscribe from this group, send email to
>    pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>    For more options, visit this group at
>    http://groups.google.com/group/pandoc-discuss?hl=en.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: output html entities?
  2011-01-16  0:46     ` Mark (my words)
  2011-01-16  2:59       ` John MacFarlane
@ 2011-01-16 15:28       ` Bruce
       [not found]         ` <4c92e083-7e37-4c00-8e5b-01d96f24dcab-avJ8sObw66XHdqrNY7FC6GB/v6IoIuQBVpNB7YpNyf8@public.gmane.org>
  1 sibling, 1 reply; 10+ messages in thread
From: Bruce @ 2011-01-16 15:28 UTC (permalink / raw)
  To: pandoc-discuss



On Jan 15, 7:46 pm, "Mark (my words)" <elib...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> My apologies to John for replying to him personally instead of the group as
> I intended. Here’s my response so other’s can comment.
>
> It causes havoc with html validators, some websites and comment
> systems don’t handle unicode, in some fonts there is no discernible
> difference between a hyphen and an en or em dash,

But effectively, in 2011, aren't all of the above issues you list bugs
with that software?

If John is open to adding an option for this, that's up to him, but
I'd be very strongly averse to having such behavior default.

Bruce

> it makes, for me,
> editing the html tougher at a later date, and it’s a personal
> preference in html code I’d much rather see &hellip; then a character
> that without mousing over I can’t tell is an actual ellipsis or
> separate periods, and, too often for my liking, switching systems and
> apps can cause the unicode to be displayed as some crazy east european
> diacritics.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: output html entities?
       [not found]         ` <4c92e083-7e37-4c00-8e5b-01d96f24dcab-avJ8sObw66XHdqrNY7FC6GB/v6IoIuQBVpNB7YpNyf8@public.gmane.org>
@ 2011-01-16 15:42           ` Bruce
       [not found]             ` <1b63e5cd-5fe8-424e-a246-c5e39cd2e3b4-G0pd0bfxH2K3V1BluC5fqGB/v6IoIuQBVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Bruce @ 2011-01-16 15:42 UTC (permalink / raw)
  To: pandoc-discuss

Also, can't you post-process? E.g. run the utf-8 (X)HTML through some
commmandline tool (tidy? xmllint?) to get what you want?

On Jan 16, 10:28 am, Bruce <bdarcus.li...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Jan 15, 7:46 pm, "Mark (my words)" <elib...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> > My apologies to John for replying to him personally instead of the group as
> > I intended. Here’s my response so other’s can comment.
>
> > It causes havoc with html validators, some websites and comment
> > systems don’t handle unicode, in some fonts there is no discernible
> > difference between a hyphen and an en or em dash,
>
> But effectively, in 2011, aren't all of the above issues you list bugs
> with that software?
>
> If John is open to adding an option for this, that's up to him, but
> I'd be very strongly averse to having such behavior default.
>
> Bruce
>
>
>
>
>
>
>
> > it makes, for me,
> > editing the html tougher at a later date, and it’s a personal
> > preference in html code I’d much rather see &hellip; then a character
> > that without mousing over I can’t tell is an actual ellipsis or
> > separate periods, and, too often for my liking, switching systems and
> > apps can cause the unicode to be displayed as some crazy east european
> > diacritics.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: output html entities?
       [not found]             ` <1b63e5cd-5fe8-424e-a246-c5e39cd2e3b4-G0pd0bfxH2K3V1BluC5fqGB/v6IoIuQBVpNB7YpNyf8@public.gmane.org>
@ 2011-01-16 19:18               ` Mark (my words)
  2015-01-09 19:53                 ` Adam Wood
  0 siblings, 1 reply; 10+ messages in thread
From: Mark (my words) @ 2011-01-16 19:18 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 1339 bytes --]

Well, I’m embarrassed.

I have been out of the loop for a *long* time. Back in the day I saw the 
debate go from not using special characters, to using named entities, to 
numerical entities, and back and forth. And now—

Are we actually to a point were we can use real raw characters?!
It strikes me as a fantastic magic.

Bruce is right, we should be living in the present and planning for the 
future. My bizarre need for human readable machine language is antiquated 
and needless and moot—you can’t get much closer to human-readable than 
straight-out unicode.

Now I’m feeling rather silly for hacking up the Multimarkdown source code to 
spit out named-entites now, but it was a lot of fun.

And yeah, I’d installed the latest Tidy a couple months back but haven’t had 
the time to screw with it until now, again more magic, that project has come 
a long way too!


Thanks guys for your patient advice.


-Mark

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.


[-- Attachment #2: Type: text/html, Size: 1562 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: output html entities?
  2011-01-16 19:18               ` Mark (my words)
@ 2015-01-09 19:53                 ` Adam Wood
       [not found]                   ` <e5aaeeee-0d20-414c-a90f-7e67b347fb5a-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Adam Wood @ 2015-01-09 19:53 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


[-- Attachment #1.1: Type: text/plain, Size: 2907 bytes --]

Well this is 3 years later, but I happened to be looking into something and 
ran across it.

So... I can't tell if Mark was being facetious or not in his disavowal of 
any desire to have html entities for curly quotes, emdashes, etc.

I still strongly prefer it, especially for certain cases where I don't have 
control over the display environment. (I write for other people, and 
sometimes other people have bad html character set declarations --- also 
commenting systems, feeds, etc.)

My solution has been to execute pandoc from within a bash script I wrote 
that goes back afterwards and uses sed to replace characters with their 
appropriate entity.
(I also use it to direct the output to an appropriate directory, with a 
filename based on the original, and with all the other options I want --- 
rather than trying to remember and have to type a million flags and options 
and two file names into the command line)

kfile="$1.kramdown"
hfile="../html/$1.html"

pandoc -f markdown-auto_identifiers -S -o $hfile $kfile
sed -i '' -e "s/’/\&rsquo;/g" -e "s/‘/\&lsquo;/g" -e 's/“/\&ldquo;/g' -e 
's/”/\&rdquo;/g' -e 's/—/\&mdash;/g' -e's/–/\&ndash;/g' $hfile
open -a "Sublime Text" $hfile








On Sunday, January 16, 2011 at 11:18:44 AM UTC-8, Mark (my words) wrote:
>
> Well, I’m embarrassed.
>
> I have been out of the loop for a *long* time. Back in the day I saw the 
> debate go from not using special characters, to using named entities, to 
> numerical entities, and back and forth. And now—
>
> Are we actually to a point were we can use real raw characters?!
> It strikes me as a fantastic magic.
>
> Bruce is right, we should be living in the present and planning for the 
> future. My bizarre need for human readable machine language is antiquated 
> and needless and moot—you can’t get much closer to human-readable than 
> straight-out unicode.
>
> Now I’m feeling rather silly for hacking up the Multimarkdown source code 
> to spit out named-entites now, but it was a lot of fun.
>
> And yeah, I’d installed the latest Tidy a couple months back but haven’t 
> had the time to screw with it until now, again more magic, that project has 
> come a long way too!
>
>
> Thanks guys for your patient advice.
>
>
> -Mark
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/e5aaeeee-0d20-414c-a90f-7e67b347fb5a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 4156 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: output html entities?
       [not found]                   ` <e5aaeeee-0d20-414c-a90f-7e67b347fb5a-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2015-01-09 20:14                     ` Matthew Pickering
  2015-01-09 20:33                     ` Daniel Staal
  1 sibling, 0 replies; 10+ messages in thread
From: Matthew Pickering @ 2015-01-09 20:14 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

I think the idiomatic way to do this substitution now would be with a
filter. I can whip up a quick example if you like?

On Fri, Jan 9, 2015 at 7:53 PM, Adam Wood <adam.michael.wood-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Well this is 3 years later, but I happened to be looking into something and
> ran across it.
>
> So... I can't tell if Mark was being facetious or not in his disavowal of
> any desire to have html entities for curly quotes, emdashes, etc.
>
> I still strongly prefer it, especially for certain cases where I don't have
> control over the display environment. (I write for other people, and
> sometimes other people have bad html character set declarations --- also
> commenting systems, feeds, etc.)
>
> My solution has been to execute pandoc from within a bash script I wrote
> that goes back afterwards and uses sed to replace characters with their
> appropriate entity.
> (I also use it to direct the output to an appropriate directory, with a
> filename based on the original, and with all the other options I want ---
> rather than trying to remember and have to type a million flags and options
> and two file names into the command line)
>
> kfile="$1.kramdown"
> hfile="../html/$1.html"
>
> pandoc -f markdown-auto_identifiers -S -o $hfile $kfile
> sed -i '' -e "s/’/\&rsquo;/g" -e "s/‘/\&lsquo;/g" -e 's/“/\&ldquo;/g' -e
> 's/”/\&rdquo;/g' -e 's/—/\&mdash;/g' -e's/–/\&ndash;/g' $hfile
> open -a "Sublime Text" $hfile
>
>
>
>
>
>
>
>
> On Sunday, January 16, 2011 at 11:18:44 AM UTC-8, Mark (my words) wrote:
>>
>> Well, I’m embarrassed.
>>
>> I have been out of the loop for a long time. Back in the day I saw the
>> debate go from not using special characters, to using named entities, to
>> numerical entities, and back and forth. And now—
>>
>> Are we actually to a point were we can use real raw characters?!
>> It strikes me as a fantastic magic.
>>
>> Bruce is right, we should be living in the present and planning for the
>> future. My bizarre need for human readable machine language is antiquated
>> and needless and moot—you can’t get much closer to human-readable than
>> straight-out unicode.
>>
>> Now I’m feeling rather silly for hacking up the Multimarkdown source code
>> to spit out named-entites now, but it was a lot of fun.
>>
>> And yeah, I’d installed the latest Tidy a couple months back but haven’t
>> had the time to screw with it until now, again more magic, that project has
>> come a long way too!
>>
>>
>> Thanks guys for your patient advice.
>>
>>
>> -Mark
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/e5aaeeee-0d20-414c-a90f-7e67b347fb5a%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CALuQ0m92svWXkz3ghXACCiYkX_FjPghZ4hL_5Ljq6L0fDcMPuw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: output html entities?
       [not found]                   ` <e5aaeeee-0d20-414c-a90f-7e67b347fb5a-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2015-01-09 20:14                     ` Matthew Pickering
@ 2015-01-09 20:33                     ` Daniel Staal
  1 sibling, 0 replies; 10+ messages in thread
From: Daniel Staal @ 2015-01-09 20:33 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

--As of January 9, 2015 11:53:46 AM -0800, Adam Wood is alleged to have 
said:

> Well this is 3 years later, but I happened to be looking into something
> and ran across it.
>
> So... I can't tell if Mark was being facetious or not in his disavowal of
> any desire to have html entities for curly quotes, emdashes, etc.
>
> I still strongly prefer it, especially for certain cases where I don't
> have control over the display environment. (I write for other people, and
> sometimes other people have bad html character set declarations --- also
> commenting systems, feeds, etc.)
>
> My solution has been to execute pandoc from within a bash script I wrote
> that goes back afterwards and uses sed to replace characters with their
> appropriate entity.
> (I also use it to direct the output to an appropriate directory, with a
> filename based on the original, and with all the other options I want ---
> rather than trying to remember and have to type a million flags and
> options and two file names into the command line)

--As for the rest, it is mine.

Just out of curiosity, does the `--ascii` option do about what you want?

(Though personally I'd like a 'plain text' option as well sometimes - I 
have places that can't take the entities either.  Though BBEdit can clean 
it up easily enough, so it doesn't really bug me.)

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-01-09 20:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-15 19:14 output html entities? Mark (my words)
2011-01-15 21:38 ` John MacFarlane
     [not found]   ` <20110115213823.GA42115-8pRoOo7FpFbPoI0UbmHJ02ZHpeb/A1Y/@public.gmane.org>
2011-01-16  0:46     ` Mark (my words)
2011-01-16  2:59       ` John MacFarlane
2011-01-16 15:28       ` Bruce
     [not found]         ` <4c92e083-7e37-4c00-8e5b-01d96f24dcab-avJ8sObw66XHdqrNY7FC6GB/v6IoIuQBVpNB7YpNyf8@public.gmane.org>
2011-01-16 15:42           ` Bruce
     [not found]             ` <1b63e5cd-5fe8-424e-a246-c5e39cd2e3b4-G0pd0bfxH2K3V1BluC5fqGB/v6IoIuQBVpNB7YpNyf8@public.gmane.org>
2011-01-16 19:18               ` Mark (my words)
2015-01-09 19:53                 ` Adam Wood
     [not found]                   ` <e5aaeeee-0d20-414c-a90f-7e67b347fb5a-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2015-01-09 20:14                     ` Matthew Pickering
2015-01-09 20:33                     ` Daniel Staal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).