* output html entities? @ 2011-01-15 19:14 Mark (my words) 2011-01-15 21:38 ` John MacFarlane 0 siblings, 1 reply; 10+ messages in thread From: Mark (my words) @ 2011-01-15 19:14 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw [-- Attachment #1: Type: text/plain, Size: 733 bytes --] Heya, I’ve noticed that pandoc markdown to html does not use html entities. em dashes, en dashes, ellipses, and quotation marks are output as actual characters. This happens regardless of the input format, for example: —, --, —, are all converted to —. Is there a way to have pandoc output HTML entities? Mark -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en. [-- Attachment #2: Type: text/html, Size: 882 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: output html entities? 2011-01-15 19:14 output html entities? Mark (my words) @ 2011-01-15 21:38 ` John MacFarlane [not found] ` <20110115213823.GA42115-8pRoOo7FpFbPoI0UbmHJ02ZHpeb/A1Y/@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: John MacFarlane @ 2011-01-15 21:38 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw +++ Mark (my words) [Jan 15 11 11:14 ]: > Heya, > > Ive noticed that pandoc markdown to html does not use html entities. em > dashes, en dashes, ellipses, and quotation marks are output as actual > characters. > > This happens regardless of the input format, for example: —, --, > , are all converted to . > > Is there a way to have pandoc output HTML entities? Pandoc converts all entities to unicode characters. Your HTML output should be in UTF-8, and this should render just fine provided you specify the charset (or use the `-s` option with the default template). Why do you need to output entities? John ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20110115213823.GA42115-8pRoOo7FpFbPoI0UbmHJ02ZHpeb/A1Y/@public.gmane.org>]
* Re: output html entities? [not found] ` <20110115213823.GA42115-8pRoOo7FpFbPoI0UbmHJ02ZHpeb/A1Y/@public.gmane.org> @ 2011-01-16 0:46 ` Mark (my words) 2011-01-16 2:59 ` John MacFarlane 2011-01-16 15:28 ` Bruce 0 siblings, 2 replies; 10+ messages in thread From: Mark (my words) @ 2011-01-16 0:46 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw [-- Attachment #1: Type: text/plain, Size: 1089 bytes --] My apologies to John for replying to him personally instead of the group as I intended. Here’s my response so other’s can comment. It causes havoc with html validators, some websites and comment systems don’t handle unicode, in some fonts there is no discernible difference between a hyphen and an en or em dash, it makes, for me, editing the html tougher at a later date, and it’s a personal preference in html code I’d much rather see … then a character that without mousing over I can’t tell is an actual ellipsis or separate periods, and, too often for my liking, switching systems and apps can cause the unicode to be displayed as some crazy east european diacritics. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en. [-- Attachment #2: Type: text/html, Size: 1241 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: output html entities? 2011-01-16 0:46 ` Mark (my words) @ 2011-01-16 2:59 ` John MacFarlane 2011-01-16 15:28 ` Bruce 1 sibling, 0 replies; 10+ messages in thread From: John MacFarlane @ 2011-01-16 2:59 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw If people wanted this feature, I could add an --ascii option that would cause the HTML reader to emit entities instead of non-ascii characters. (Later I could add something similar in LaTeX, though this would only be partial as there is no *general* way, akin to numerical character references, of representing unicode characters in LaTeX.) To be clear, this would mean that ALL non-ascii characters get represented as entities, even things like accented vowels, regardless of whether you entered them as entities or with unicode characters. So, for example, "ä …" would be rendered with --ascii as "ä …", and by default as "ä …". Character entity references would be used when available, otherwise numerical. Thoughts? John +++ Mark (my words) [Jan 15 11 16:46 ]: > My apologies to John for replying to him personally instead of the > group as I intended. Heres my response so others can comment. > > It causes havoc with html validators, some websites and comment > > systems dont handle unicode, in some fonts there is no discernible > > difference between a hyphen and an en or em dash, it makes, for me, > > editing the html tougher at a later date, and its a personal > > preference in html code Id much rather see … then a character > > that without mousing over I cant tell is an actual ellipsis or > > separate periods, and, too often for my liking, switching systems and > > apps can cause the unicode to be displayed as some crazy east european > > diacritics. > > -- > You received this message because you are subscribed to the Google > Groups "pandoc-discuss" group. > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To unsubscribe from this group, send email to > pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > For more options, visit this group at > http://groups.google.com/group/pandoc-discuss?hl=en. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: output html entities? 2011-01-16 0:46 ` Mark (my words) 2011-01-16 2:59 ` John MacFarlane @ 2011-01-16 15:28 ` Bruce [not found] ` <4c92e083-7e37-4c00-8e5b-01d96f24dcab-avJ8sObw66XHdqrNY7FC6GB/v6IoIuQBVpNB7YpNyf8@public.gmane.org> 1 sibling, 1 reply; 10+ messages in thread From: Bruce @ 2011-01-16 15:28 UTC (permalink / raw) To: pandoc-discuss On Jan 15, 7:46 pm, "Mark (my words)" <elib...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > My apologies to John for replying to him personally instead of the group as > I intended. Here’s my response so other’s can comment. > > It causes havoc with html validators, some websites and comment > systems don’t handle unicode, in some fonts there is no discernible > difference between a hyphen and an en or em dash, But effectively, in 2011, aren't all of the above issues you list bugs with that software? If John is open to adding an option for this, that's up to him, but I'd be very strongly averse to having such behavior default. Bruce > it makes, for me, > editing the html tougher at a later date, and it’s a personal > preference in html code I’d much rather see … then a character > that without mousing over I can’t tell is an actual ellipsis or > separate periods, and, too often for my liking, switching systems and > apps can cause the unicode to be displayed as some crazy east european > diacritics. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en. ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <4c92e083-7e37-4c00-8e5b-01d96f24dcab-avJ8sObw66XHdqrNY7FC6GB/v6IoIuQBVpNB7YpNyf8@public.gmane.org>]
* Re: output html entities? [not found] ` <4c92e083-7e37-4c00-8e5b-01d96f24dcab-avJ8sObw66XHdqrNY7FC6GB/v6IoIuQBVpNB7YpNyf8@public.gmane.org> @ 2011-01-16 15:42 ` Bruce [not found] ` <1b63e5cd-5fe8-424e-a246-c5e39cd2e3b4-G0pd0bfxH2K3V1BluC5fqGB/v6IoIuQBVpNB7YpNyf8@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Bruce @ 2011-01-16 15:42 UTC (permalink / raw) To: pandoc-discuss Also, can't you post-process? E.g. run the utf-8 (X)HTML through some commmandline tool (tidy? xmllint?) to get what you want? On Jan 16, 10:28 am, Bruce <bdarcus.li...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > On Jan 15, 7:46 pm, "Mark (my words)" <elib...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > > My apologies to John for replying to him personally instead of the group as > > I intended. Here’s my response so other’s can comment. > > > It causes havoc with html validators, some websites and comment > > systems don’t handle unicode, in some fonts there is no discernible > > difference between a hyphen and an en or em dash, > > But effectively, in 2011, aren't all of the above issues you list bugs > with that software? > > If John is open to adding an option for this, that's up to him, but > I'd be very strongly averse to having such behavior default. > > Bruce > > > > > > > > > it makes, for me, > > editing the html tougher at a later date, and it’s a personal > > preference in html code I’d much rather see … then a character > > that without mousing over I can’t tell is an actual ellipsis or > > separate periods, and, too often for my liking, switching systems and > > apps can cause the unicode to be displayed as some crazy east european > > diacritics. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en. ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <1b63e5cd-5fe8-424e-a246-c5e39cd2e3b4-G0pd0bfxH2K3V1BluC5fqGB/v6IoIuQBVpNB7YpNyf8@public.gmane.org>]
* Re: output html entities? [not found] ` <1b63e5cd-5fe8-424e-a246-c5e39cd2e3b4-G0pd0bfxH2K3V1BluC5fqGB/v6IoIuQBVpNB7YpNyf8@public.gmane.org> @ 2011-01-16 19:18 ` Mark (my words) 2015-01-09 19:53 ` Adam Wood 0 siblings, 1 reply; 10+ messages in thread From: Mark (my words) @ 2011-01-16 19:18 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw [-- Attachment #1: Type: text/plain, Size: 1339 bytes --] Well, I’m embarrassed. I have been out of the loop for a *long* time. Back in the day I saw the debate go from not using special characters, to using named entities, to numerical entities, and back and forth. And now— Are we actually to a point were we can use real raw characters?! It strikes me as a fantastic magic. Bruce is right, we should be living in the present and planning for the future. My bizarre need for human readable machine language is antiquated and needless and moot—you can’t get much closer to human-readable than straight-out unicode. Now I’m feeling rather silly for hacking up the Multimarkdown source code to spit out named-entites now, but it was a lot of fun. And yeah, I’d installed the latest Tidy a couple months back but haven’t had the time to screw with it until now, again more magic, that project has come a long way too! Thanks guys for your patient advice. -Mark -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en. [-- Attachment #2: Type: text/html, Size: 1562 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: output html entities? 2011-01-16 19:18 ` Mark (my words) @ 2015-01-09 19:53 ` Adam Wood [not found] ` <e5aaeeee-0d20-414c-a90f-7e67b347fb5a-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Adam Wood @ 2015-01-09 19:53 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw [-- Attachment #1.1: Type: text/plain, Size: 2907 bytes --] Well this is 3 years later, but I happened to be looking into something and ran across it. So... I can't tell if Mark was being facetious or not in his disavowal of any desire to have html entities for curly quotes, emdashes, etc. I still strongly prefer it, especially for certain cases where I don't have control over the display environment. (I write for other people, and sometimes other people have bad html character set declarations --- also commenting systems, feeds, etc.) My solution has been to execute pandoc from within a bash script I wrote that goes back afterwards and uses sed to replace characters with their appropriate entity. (I also use it to direct the output to an appropriate directory, with a filename based on the original, and with all the other options I want --- rather than trying to remember and have to type a million flags and options and two file names into the command line) kfile="$1.kramdown" hfile="../html/$1.html" pandoc -f markdown-auto_identifiers -S -o $hfile $kfile sed -i '' -e "s/’/\’/g" -e "s/‘/\‘/g" -e 's/“/\“/g' -e 's/”/\”/g' -e 's/—/\—/g' -e's/–/\–/g' $hfile open -a "Sublime Text" $hfile On Sunday, January 16, 2011 at 11:18:44 AM UTC-8, Mark (my words) wrote: > > Well, I’m embarrassed. > > I have been out of the loop for a *long* time. Back in the day I saw the > debate go from not using special characters, to using named entities, to > numerical entities, and back and forth. And now— > > Are we actually to a point were we can use real raw characters?! > It strikes me as a fantastic magic. > > Bruce is right, we should be living in the present and planning for the > future. My bizarre need for human readable machine language is antiquated > and needless and moot—you can’t get much closer to human-readable than > straight-out unicode. > > Now I’m feeling rather silly for hacking up the Multimarkdown source code > to spit out named-entites now, but it was a lot of fun. > > And yeah, I’d installed the latest Tidy a couple months back but haven’t > had the time to screw with it until now, again more magic, that project has > come a long way too! > > > Thanks guys for your patient advice. > > > -Mark > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/e5aaeeee-0d20-414c-a90f-7e67b347fb5a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #1.2: Type: text/html, Size: 4156 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <e5aaeeee-0d20-414c-a90f-7e67b347fb5a-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: output html entities? [not found] ` <e5aaeeee-0d20-414c-a90f-7e67b347fb5a-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2015-01-09 20:14 ` Matthew Pickering 2015-01-09 20:33 ` Daniel Staal 1 sibling, 0 replies; 10+ messages in thread From: Matthew Pickering @ 2015-01-09 20:14 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw I think the idiomatic way to do this substitution now would be with a filter. I can whip up a quick example if you like? On Fri, Jan 9, 2015 at 7:53 PM, Adam Wood <adam.michael.wood-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > Well this is 3 years later, but I happened to be looking into something and > ran across it. > > So... I can't tell if Mark was being facetious or not in his disavowal of > any desire to have html entities for curly quotes, emdashes, etc. > > I still strongly prefer it, especially for certain cases where I don't have > control over the display environment. (I write for other people, and > sometimes other people have bad html character set declarations --- also > commenting systems, feeds, etc.) > > My solution has been to execute pandoc from within a bash script I wrote > that goes back afterwards and uses sed to replace characters with their > appropriate entity. > (I also use it to direct the output to an appropriate directory, with a > filename based on the original, and with all the other options I want --- > rather than trying to remember and have to type a million flags and options > and two file names into the command line) > > kfile="$1.kramdown" > hfile="../html/$1.html" > > pandoc -f markdown-auto_identifiers -S -o $hfile $kfile > sed -i '' -e "s/’/\’/g" -e "s/‘/\‘/g" -e 's/“/\“/g' -e > 's/”/\”/g' -e 's/—/\—/g' -e's/–/\–/g' $hfile > open -a "Sublime Text" $hfile > > > > > > > > > On Sunday, January 16, 2011 at 11:18:44 AM UTC-8, Mark (my words) wrote: >> >> Well, I’m embarrassed. >> >> I have been out of the loop for a long time. Back in the day I saw the >> debate go from not using special characters, to using named entities, to >> numerical entities, and back and forth. And now— >> >> Are we actually to a point were we can use real raw characters?! >> It strikes me as a fantastic magic. >> >> Bruce is right, we should be living in the present and planning for the >> future. My bizarre need for human readable machine language is antiquated >> and needless and moot—you can’t get much closer to human-readable than >> straight-out unicode. >> >> Now I’m feeling rather silly for hacking up the Multimarkdown source code >> to spit out named-entites now, but it was a lot of fun. >> >> And yeah, I’d installed the latest Tidy a couple months back but haven’t >> had the time to screw with it until now, again more magic, that project has >> come a long way too! >> >> >> Thanks guys for your patient advice. >> >> >> -Mark > > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/e5aaeeee-0d20-414c-a90f-7e67b347fb5a%40googlegroups.com. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CALuQ0m92svWXkz3ghXACCiYkX_FjPghZ4hL_5Ljq6L0fDcMPuw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: output html entities? [not found] ` <e5aaeeee-0d20-414c-a90f-7e67b347fb5a-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2015-01-09 20:14 ` Matthew Pickering @ 2015-01-09 20:33 ` Daniel Staal 1 sibling, 0 replies; 10+ messages in thread From: Daniel Staal @ 2015-01-09 20:33 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw --As of January 9, 2015 11:53:46 AM -0800, Adam Wood is alleged to have said: > Well this is 3 years later, but I happened to be looking into something > and ran across it. > > So... I can't tell if Mark was being facetious or not in his disavowal of > any desire to have html entities for curly quotes, emdashes, etc. > > I still strongly prefer it, especially for certain cases where I don't > have control over the display environment. (I write for other people, and > sometimes other people have bad html character set declarations --- also > commenting systems, feeds, etc.) > > My solution has been to execute pandoc from within a bash script I wrote > that goes back afterwards and uses sed to replace characters with their > appropriate entity. > (I also use it to direct the output to an appropriate directory, with a > filename based on the original, and with all the other options I want --- > rather than trying to remember and have to type a million flags and > options and two file names into the command line) --As for the rest, it is mine. Just out of curiosity, does the `--ascii` option do about what you want? (Though personally I'd like a 'plain text' option as well sometimes - I have places that can't take the entities either. Though BBEdit can clean it up easily enough, so it doesn't really bug me.) Daniel T. Staal --------------------------------------------------------------- This email copyright the author. Unless otherwise noted, you are expressly allowed to retransmit, quote, or otherwise use the contents for non-commercial purposes. This copyright will expire 5 years after the author's death, or in 30 years, whichever is longer, unless such a period is in excess of local copyright law. --------------------------------------------------------------- ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-01-09 20:33 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-01-15 19:14 output html entities? Mark (my words) 2011-01-15 21:38 ` John MacFarlane [not found] ` <20110115213823.GA42115-8pRoOo7FpFbPoI0UbmHJ02ZHpeb/A1Y/@public.gmane.org> 2011-01-16 0:46 ` Mark (my words) 2011-01-16 2:59 ` John MacFarlane 2011-01-16 15:28 ` Bruce [not found] ` <4c92e083-7e37-4c00-8e5b-01d96f24dcab-avJ8sObw66XHdqrNY7FC6GB/v6IoIuQBVpNB7YpNyf8@public.gmane.org> 2011-01-16 15:42 ` Bruce [not found] ` <1b63e5cd-5fe8-424e-a246-c5e39cd2e3b4-G0pd0bfxH2K3V1BluC5fqGB/v6IoIuQBVpNB7YpNyf8@public.gmane.org> 2011-01-16 19:18 ` Mark (my words) 2015-01-09 19:53 ` Adam Wood [not found] ` <e5aaeeee-0d20-414c-a90f-7e67b347fb5a-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2015-01-09 20:14 ` Matthew Pickering 2015-01-09 20:33 ` Daniel Staal
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).