* Flaws in the Pandoc Unicode (OK, UTF-8) handling @ 2015-02-01 2:42 Gordon Steemson [not found] ` <97d9c232-cc87-41ca-859b-f7495db3148f-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 0 siblings, 1 reply; 3+ messages in thread From: Gordon Steemson @ 2015-02-01 2:42 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw [-- Attachment #1.1: Type: text/plain, Size: 2084 bytes --] I came very close to getting Pandoc to actually do what I mean today. Unfortunately, when I ran my Pandoc wrapper script (it divides up my custom-formatted whole-story Markdown files into individual chapters, each with a prepended metadata block, then calls Pandoc on each individual chapter) on a different input file, it worked the first couple of times and then started complaining that a specific well-formed UTF-8 character wasn’t well-formed (specifically, the CJKV ideograph for girl/woman/female: 女). Pandoc is the only software I can find that makes this claim about my file, so I am inclined to believe the file is not at fault — especially since it worked fine yesterday. I have reinstalled both Haskell and Pandoc, without effect. This is not the first time Pandoc has been annoying at me about UTF-8 interpretation; I have found that any attempt to print UTF-8 text to standard output or standard error from within my custom writer is doomed to failure. The individual bytes within each UTF-8 encoded character are being interpreted by some layer within Pandoc as Latin-1 or some similar single-byte encoding, and then erroneously re-translated into a string of two or three UTF-8 characters for every single UTF-8 character I try to output. Every software setting I have control of is set to UTF-8. Even setting the locale within Lua with “os.setlocale('en_CA.UTF-8')” doesn’t have any effect. I’m completely stumped here. Help! -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/97d9c232-cc87-41ca-859b-f7495db3148f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #1.2: Type: text/html, Size: 2581 bytes --] ^ permalink raw reply [flat|nested] 3+ messages in thread
[parent not found: <97d9c232-cc87-41ca-859b-f7495db3148f-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: Flaws in the Pandoc Unicode (OK, UTF-8) handling [not found] ` <97d9c232-cc87-41ca-859b-f7495db3148f-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2015-02-01 6:49 ` John MacFarlane [not found] ` <20150201064928.GC12964-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org> 0 siblings, 1 reply; 3+ messages in thread From: John MacFarlane @ 2015-02-01 6:49 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw There was a fix for UTF-8 in custom lua writers in 1.12.4, so if your version is earlier you should upgrade. I have no problem with the character you mention in a custom writer: % pandoc -t data/sample.lua girl/woman/female: 女) ^D <p>girl/woman/female: 女)</p> Can you reproduce the problem with the sample custom writer, data/sample.lua? +++ Gordon Steemson [Jan 31 15 18:42 ]: >I came very close to getting Pandoc to actually do what I mean today. >Unfortunately, when I ran my Pandoc wrapper script (it divides up my >custom-formatted whole-story Markdown files into individual chapters, each >with a prepended metadata block, then calls Pandoc on each individual >chapter) on a different input file, it worked the first couple of times and >then started complaining that a specific well-formed UTF-8 character wasn’t >well-formed (specifically, the CJKV ideograph for girl/woman/female: 女). Pandoc >is the only software I can find that makes this claim about my file, so I >am inclined to believe the file is not at fault — especially since it >worked fine yesterday. I have reinstalled both Haskell and Pandoc, without >effect. > >This is not the first time Pandoc has been annoying at me about UTF-8 >interpretation; I have found that any attempt to print UTF-8 text to >standard output or standard error from within my custom writer is doomed to >failure. The individual bytes within each UTF-8 encoded character are being >interpreted by some layer within Pandoc as Latin-1 or some similar >single-byte encoding, and then erroneously re-translated into a string of >two or three UTF-8 characters for every single UTF-8 character I try to >output. > >Every software setting I have control of is set to UTF-8. Even setting the >locale within Lua with “os.setlocale('en_CA.UTF-8')” doesn’t have any >effect. > >I’m completely stumped here. Help! > >-- >You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. >To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/97d9c232-cc87-41ca-859b-f7495db3148f%40googlegroups.com. >For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20150201064928.GC12964%40localhost.hsd1.ca.comcast.net. For more options, visit https://groups.google.com/d/optout. ^ permalink raw reply [flat|nested] 3+ messages in thread
[parent not found: <20150201064928.GC12964-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org>]
* Re: Flaws in the Pandoc Unicode (OK, UTF-8) handling [not found] ` <20150201064928.GC12964-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org> @ 2015-02-01 7:43 ` Gordon Steemson 0 siblings, 0 replies; 3+ messages in thread From: Gordon Steemson @ 2015-02-01 7:43 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw, John MacFarlane [-- Attachment #1: Type: text/plain, Size: 3589 bytes --] Thanks John for your prompt reply. On 31 January 2015 at 22:49, John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> wrote: > There was a fix for UTF-8 in custom lua writers in 1.12.4, so if your > version is earlier you should upgrade. > I’m using the latest stable version, 1.13.2. I have no problem with the character you mention in a custom writer: > > % pandoc -t data/sample.lua > girl/woman/female: 女) > ^D > <p>girl/woman/female: 女)</p> > > Can you reproduce the problem with the sample custom writer, > data/sample.lua? > It works fine in sample.lua. However, until about 10 AM today it also worked fine in my custom writer. I think something a little more subtle is going on here. I should add that the problem is not being triggered from the main body of the work… it’s coming from a > block in my YAML metadata header, which I found to be a fine place to keep stuff like author’s notes. Incidentally, I don’t know why, but for the markdown to parse correctly, you need to insert _two_ blank lines between paragraph text and the start of a bullet list in YAML metadata. If you only leave one blank line between them, the first bullet-list item gets folded into the preceding text paragraph. Kind of strange but there you are. Gordon +++ Gordon Steemson [Jan 31 15 18:42 ]: > >> I came very close to getting Pandoc to actually do what I mean today. >> Unfortunately, when I ran my Pandoc wrapper script (it divides up my >> custom-formatted whole-story Markdown files into individual chapters, each >> with a prepended metadata block, then calls Pandoc on each individual >> chapter) on a different input file, it worked the first couple of times >> and >> then started complaining that a specific well-formed UTF-8 character >> wasn’t >> well-formed (specifically, the CJKV ideograph for girl/woman/female: 女). >> Pandoc >> is the only software I can find that makes this claim about my file, so I >> am inclined to believe the file is not at fault — especially since it >> worked fine yesterday. I have reinstalled both Haskell and Pandoc, without >> effect. >> >> This is not the first time Pandoc has been annoying at me about UTF-8 >> interpretation; I have found that any attempt to print UTF-8 text to >> standard output or standard error from within my custom writer is doomed >> to >> failure. The individual bytes within each UTF-8 encoded character are >> being >> interpreted by some layer within Pandoc as Latin-1 or some similar >> single-byte encoding, and then erroneously re-translated into a string of >> two or three UTF-8 characters for every single UTF-8 character I try to >> output. >> >> Every software setting I have control of is set to UTF-8. Even setting the >> locale within Lua with “os.setlocale('en_CA.UTF-8')” doesn’t have any >> effect. >> >> I’m completely stumped here. Help! >> > -- The world’s only gsteemso -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CABKoxZoMwz0un9icMY2AWqstfaUmiqgB5jwa3zfVhBUrtpF6gA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout. [-- Attachment #2: Type: text/html, Size: 4949 bytes --] ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2015-02-01 7:43 UTC | newest] Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-02-01 2:42 Flaws in the Pandoc Unicode (OK, UTF-8) handling Gordon Steemson [not found] ` <97d9c232-cc87-41ca-859b-f7495db3148f-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2015-02-01 6:49 ` John MacFarlane [not found] ` <20150201064928.GC12964-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org> 2015-02-01 7:43 ` Gordon Steemson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).