Flaws in the Pandoc Unicode (OK, UTF-8) handling

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* Flaws in the Pandoc Unicode (OK, UTF-8) handling
@ 2015-02-01  2:42 Gordon Steemson
       [not found] ` <97d9c232-cc87-41ca-859b-f7495db3148f-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Gordon Steemson @ 2015-02-01  2:42 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1.1: Type: text/plain, Size: 2084 bytes --]

I came very close to getting Pandoc to actually do what I mean today. 
Unfortunately, when I ran my Pandoc wrapper script (it divides up my 
custom-formatted whole-story Markdown files into individual chapters, each 
with a prepended metadata block, then calls Pandoc on each individual 
chapter) on a different input file, it worked the first couple of times and 
then started complaining that a specific well-formed UTF-8 character wasn’t 
well-formed (specifically, the CJKV ideograph for girl/woman/female: 女). Pandoc 
is the only software I can find that makes this claim about my file, so I 
am inclined to believe the file is not at fault — especially since it 
worked fine yesterday. I have reinstalled both Haskell and Pandoc, without 
effect.

This is not the first time Pandoc has been annoying at me about UTF-8 
interpretation; I have found that any attempt to print UTF-8 text to 
standard output or standard error from within my custom writer is doomed to 
failure. The individual bytes within each UTF-8 encoded character are being 
interpreted by some layer within Pandoc as Latin-1 or some similar 
single-byte encoding, and then erroneously re-translated into a string of 
two or three UTF-8 characters for every single UTF-8 character I try to 
output.

Every software setting I have control of is set to UTF-8. Even setting the 
locale within Lua with “os.setlocale('en_CA.UTF-8')” doesn’t have any 
effect.

I’m completely stumped here. Help!

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/97d9c232-cc87-41ca-859b-f7495db3148f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 2581 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Flaws in the Pandoc Unicode (OK, UTF-8) handling
       [not found] ` <97d9c232-cc87-41ca-859b-f7495db3148f-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2015-02-01  6:49   ` John MacFarlane
       [not found]     ` <20150201064928.GC12964-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: John MacFarlane @ 2015-02-01  6:49 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

There was a fix for UTF-8 in custom lua writers in 1.12.4, so if your
version is earlier you should upgrade.

I have no problem with the character you mention in a custom writer:

    % pandoc -t data/sample.lua
    girl/woman/female: 女)
    ^D
    <p>girl/woman/female: 女)</p>

Can you reproduce the problem with the sample custom writer,
data/sample.lua?


+++ Gordon Steemson [Jan 31 15 18:42 ]:
>I came very close to getting Pandoc to actually do what I mean today.
>Unfortunately, when I ran my Pandoc wrapper script (it divides up my
>custom-formatted whole-story Markdown files into individual chapters, each
>with a prepended metadata block, then calls Pandoc on each individual
>chapter) on a different input file, it worked the first couple of times and
>then started complaining that a specific well-formed UTF-8 character wasn’t
>well-formed (specifically, the CJKV ideograph for girl/woman/female: 女). Pandoc
>is the only software I can find that makes this claim about my file, so I
>am inclined to believe the file is not at fault — especially since it
>worked fine yesterday. I have reinstalled both Haskell and Pandoc, without
>effect.
>
>This is not the first time Pandoc has been annoying at me about UTF-8
>interpretation; I have found that any attempt to print UTF-8 text to
>standard output or standard error from within my custom writer is doomed to
>failure. The individual bytes within each UTF-8 encoded character are being
>interpreted by some layer within Pandoc as Latin-1 or some similar
>single-byte encoding, and then erroneously re-translated into a string of
>two or three UTF-8 characters for every single UTF-8 character I try to
>output.
>
>Every software setting I have control of is set to UTF-8. Even setting the
>locale within Lua with “os.setlocale('en_CA.UTF-8')” doesn’t have any
>effect.
>
>I’m completely stumped here. Help!
>
>-- 
>You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/97d9c232-cc87-41ca-859b-f7495db3148f%40googlegroups.com.
>For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20150201064928.GC12964%40localhost.hsd1.ca.comcast.net.
For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Flaws in the Pandoc Unicode (OK, UTF-8) handling
       [not found]     ` <20150201064928.GC12964-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org>
@ 2015-02-01  7:43       ` Gordon Steemson
  0 siblings, 0 replies; 3+ messages in thread
From: Gordon Steemson @ 2015-02-01  7:43 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw, John MacFarlane

[-- Attachment #1: Type: text/plain, Size: 3589 bytes --]

Thanks John for your prompt reply.

On 31 January 2015 at 22:49, John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> wrote:

> There was a fix for UTF-8 in custom lua writers in 1.12.4, so if your
> version is earlier you should upgrade.
>

I’m using the latest stable version, 1.13.2.

I have no problem with the character you mention in a custom writer:
>
>    % pandoc -t data/sample.lua
>    girl/woman/female: 女)
>    ^D
>    <p>girl/woman/female: 女)</p>
>
> Can you reproduce the problem with the sample custom writer,
> data/sample.lua?
>

It works fine in sample.lua. However, until about 10 AM today it also
worked fine in my custom writer. I think something a little more subtle is
going on here.

I should add that the problem is not being triggered from the main body of
the work… it’s coming from a > block in my YAML metadata header, which I
found to be a fine place to keep stuff like author’s notes. Incidentally, I
don’t know why, but for the markdown to parse correctly, you need to insert
_two_ blank lines between paragraph text and the start of a bullet list in
YAML metadata. If you only leave one blank line between them, the first
bullet-list item gets folded into the preceding text paragraph. Kind of
strange but there you are.

Gordon





+++ Gordon Steemson [Jan 31 15 18:42 ]:
>
>> I came very close to getting Pandoc to actually do what I mean today.
>> Unfortunately, when I ran my Pandoc wrapper script (it divides up my
>> custom-formatted whole-story Markdown files into individual chapters, each
>> with a prepended metadata block, then calls Pandoc on each individual
>> chapter) on a different input file, it worked the first couple of times
>> and
>> then started complaining that a specific well-formed UTF-8 character
>> wasn’t
>> well-formed (specifically, the CJKV ideograph for girl/woman/female: 女).
>> Pandoc
>> is the only software I can find that makes this claim about my file, so I
>> am inclined to believe the file is not at fault — especially since it
>> worked fine yesterday. I have reinstalled both Haskell and Pandoc, without
>> effect.
>>
>> This is not the first time Pandoc has been annoying at me about UTF-8
>> interpretation; I have found that any attempt to print UTF-8 text to
>> standard output or standard error from within my custom writer is doomed
>> to
>> failure. The individual bytes within each UTF-8 encoded character are
>> being
>> interpreted by some layer within Pandoc as Latin-1 or some similar
>> single-byte encoding, and then erroneously re-translated into a string of
>> two or three UTF-8 characters for every single UTF-8 character I try to
>> output.
>>
>> Every software setting I have control of is set to UTF-8. Even setting the
>> locale within Lua with “os.setlocale('en_CA.UTF-8')” doesn’t have any
>> effect.
>>
>> I’m completely stumped here. Help!
>>
>
-- 
The world’s only gsteemso

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CABKoxZoMwz0un9icMY2AWqstfaUmiqgB5jwa3zfVhBUrtpF6gA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 4949 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-02-01  7:43 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-01  2:42 Flaws in the Pandoc Unicode (OK, UTF-8) handling Gordon Steemson
     [not found] ` <97d9c232-cc87-41ca-859b-f7495db3148f-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2015-02-01  6:49   ` John MacFarlane
     [not found]     ` <20150201064928.GC12964-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org>
2015-02-01  7:43       ` Gordon Steemson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).