* Flaws in the Pandoc Unicode (OK, UTF-8) handling
@ 2015-02-01 2:42 Gordon Steemson
[not found] ` <97d9c232-cc87-41ca-859b-f7495db3148f-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
0 siblings, 1 reply; 3+ messages in thread
From: Gordon Steemson @ 2015-02-01 2:42 UTC (permalink / raw)
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw
[-- Attachment #1.1: Type: text/plain, Size: 2084 bytes --]
I came very close to getting Pandoc to actually do what I mean today.
Unfortunately, when I ran my Pandoc wrapper script (it divides up my
custom-formatted whole-story Markdown files into individual chapters, each
with a prepended metadata block, then calls Pandoc on each individual
chapter) on a different input file, it worked the first couple of times and
then started complaining that a specific well-formed UTF-8 character wasn’t
well-formed (specifically, the CJKV ideograph for girl/woman/female: 女). Pandoc
is the only software I can find that makes this claim about my file, so I
am inclined to believe the file is not at fault — especially since it
worked fine yesterday. I have reinstalled both Haskell and Pandoc, without
effect.
This is not the first time Pandoc has been annoying at me about UTF-8
interpretation; I have found that any attempt to print UTF-8 text to
standard output or standard error from within my custom writer is doomed to
failure. The individual bytes within each UTF-8 encoded character are being
interpreted by some layer within Pandoc as Latin-1 or some similar
single-byte encoding, and then erroneously re-translated into a string of
two or three UTF-8 characters for every single UTF-8 character I try to
output.
Every software setting I have control of is set to UTF-8. Even setting the
locale within Lua with “os.setlocale('en_CA.UTF-8')” doesn’t have any
effect.
I’m completely stumped here. Help!
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/97d9c232-cc87-41ca-859b-f7495db3148f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
[-- Attachment #1.2: Type: text/html, Size: 2581 bytes --]
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Flaws in the Pandoc Unicode (OK, UTF-8) handling
[not found] ` <97d9c232-cc87-41ca-859b-f7495db3148f-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2015-02-01 6:49 ` John MacFarlane
[not found] ` <20150201064928.GC12964-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org>
0 siblings, 1 reply; 3+ messages in thread
From: John MacFarlane @ 2015-02-01 6:49 UTC (permalink / raw)
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw
There was a fix for UTF-8 in custom lua writers in 1.12.4, so if your
version is earlier you should upgrade.
I have no problem with the character you mention in a custom writer:
% pandoc -t data/sample.lua
girl/woman/female: 女)
^D
<p>girl/woman/female: 女)</p>
Can you reproduce the problem with the sample custom writer,
data/sample.lua?
+++ Gordon Steemson [Jan 31 15 18:42 ]:
>I came very close to getting Pandoc to actually do what I mean today.
>Unfortunately, when I ran my Pandoc wrapper script (it divides up my
>custom-formatted whole-story Markdown files into individual chapters, each
>with a prepended metadata block, then calls Pandoc on each individual
>chapter) on a different input file, it worked the first couple of times and
>then started complaining that a specific well-formed UTF-8 character wasn’t
>well-formed (specifically, the CJKV ideograph for girl/woman/female: 女). Pandoc
>is the only software I can find that makes this claim about my file, so I
>am inclined to believe the file is not at fault — especially since it
>worked fine yesterday. I have reinstalled both Haskell and Pandoc, without
>effect.
>
>This is not the first time Pandoc has been annoying at me about UTF-8
>interpretation; I have found that any attempt to print UTF-8 text to
>standard output or standard error from within my custom writer is doomed to
>failure. The individual bytes within each UTF-8 encoded character are being
>interpreted by some layer within Pandoc as Latin-1 or some similar
>single-byte encoding, and then erroneously re-translated into a string of
>two or three UTF-8 characters for every single UTF-8 character I try to
>output.
>
>Every software setting I have control of is set to UTF-8. Even setting the
>locale within Lua with “os.setlocale('en_CA.UTF-8')” doesn’t have any
>effect.
>
>I’m completely stumped here. Help!
>
>--
>You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/97d9c232-cc87-41ca-859b-f7495db3148f%40googlegroups.com.
>For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20150201064928.GC12964%40localhost.hsd1.ca.comcast.net.
For more options, visit https://groups.google.com/d/optout.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Flaws in the Pandoc Unicode (OK, UTF-8) handling
[not found] ` <20150201064928.GC12964-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org>
@ 2015-02-01 7:43 ` Gordon Steemson
0 siblings, 0 replies; 3+ messages in thread
From: Gordon Steemson @ 2015-02-01 7:43 UTC (permalink / raw)
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw, John MacFarlane
[-- Attachment #1: Type: text/plain, Size: 3589 bytes --]
Thanks John for your prompt reply.
On 31 January 2015 at 22:49, John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> wrote:
> There was a fix for UTF-8 in custom lua writers in 1.12.4, so if your
> version is earlier you should upgrade.
>
I’m using the latest stable version, 1.13.2.
I have no problem with the character you mention in a custom writer:
>
> % pandoc -t data/sample.lua
> girl/woman/female: 女)
> ^D
> <p>girl/woman/female: 女)</p>
>
> Can you reproduce the problem with the sample custom writer,
> data/sample.lua?
>
It works fine in sample.lua. However, until about 10 AM today it also
worked fine in my custom writer. I think something a little more subtle is
going on here.
I should add that the problem is not being triggered from the main body of
the work… it’s coming from a > block in my YAML metadata header, which I
found to be a fine place to keep stuff like author’s notes. Incidentally, I
don’t know why, but for the markdown to parse correctly, you need to insert
_two_ blank lines between paragraph text and the start of a bullet list in
YAML metadata. If you only leave one blank line between them, the first
bullet-list item gets folded into the preceding text paragraph. Kind of
strange but there you are.
Gordon
+++ Gordon Steemson [Jan 31 15 18:42 ]:
>
>> I came very close to getting Pandoc to actually do what I mean today.
>> Unfortunately, when I ran my Pandoc wrapper script (it divides up my
>> custom-formatted whole-story Markdown files into individual chapters, each
>> with a prepended metadata block, then calls Pandoc on each individual
>> chapter) on a different input file, it worked the first couple of times
>> and
>> then started complaining that a specific well-formed UTF-8 character
>> wasn’t
>> well-formed (specifically, the CJKV ideograph for girl/woman/female: 女).
>> Pandoc
>> is the only software I can find that makes this claim about my file, so I
>> am inclined to believe the file is not at fault — especially since it
>> worked fine yesterday. I have reinstalled both Haskell and Pandoc, without
>> effect.
>>
>> This is not the first time Pandoc has been annoying at me about UTF-8
>> interpretation; I have found that any attempt to print UTF-8 text to
>> standard output or standard error from within my custom writer is doomed
>> to
>> failure. The individual bytes within each UTF-8 encoded character are
>> being
>> interpreted by some layer within Pandoc as Latin-1 or some similar
>> single-byte encoding, and then erroneously re-translated into a string of
>> two or three UTF-8 characters for every single UTF-8 character I try to
>> output.
>>
>> Every software setting I have control of is set to UTF-8. Even setting the
>> locale within Lua with “os.setlocale('en_CA.UTF-8')” doesn’t have any
>> effect.
>>
>> I’m completely stumped here. Help!
>>
>
--
The world’s only gsteemso
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CABKoxZoMwz0un9icMY2AWqstfaUmiqgB5jwa3zfVhBUrtpF6gA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
[-- Attachment #2: Type: text/html, Size: 4949 bytes --]
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2015-02-01 7:43 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-01 2:42 Flaws in the Pandoc Unicode (OK, UTF-8) handling Gordon Steemson
[not found] ` <97d9c232-cc87-41ca-859b-f7495db3148f-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2015-02-01 6:49 ` John MacFarlane
[not found] ` <20150201064928.GC12964-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org>
2015-02-01 7:43 ` Gordon Steemson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).