wrong encoded character - what is your strategy?

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* wrong encoded character - what is your strategy?
@ 2014-06-20 14:54 Paulo Ney de Souza
       [not found] ` <CAFVhNZPoVra7suq3o8LRWf7CxX9xqAA05BgVqxssM7911Ee-kA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Paulo Ney de Souza @ 2014-06-20 14:54 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 1151 bytes --]

What is your strategy when you need to find a wrongly encoded character in
a file, after ones of these message from Pandoc ?

pandoc: Cannot decode byte '\xa0': Data.Text.Encoding.Fusion.streamUtf8:
Invalid UTF-8 stream

Mine is to do a half-baked conversion with "iconv" and compare it to the
original file:

iconv orig.txt -t utf-8 > tmp
diff orig.txt tmp

then the first line of the diff is the line of the offending
character...but this strategy is less than optimal, and if the file has no
LF then it doesn't help at all.

How do you deal with this?

Paulo Ney

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZPoVra7suq3o8LRWf7CxX9xqAA05BgVqxssM7911Ee-kA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 1740 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: wrong encoded character - what is your strategy?
       [not found] ` <CAFVhNZPoVra7suq3o8LRWf7CxX9xqAA05BgVqxssM7911Ee-kA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-06-20 15:50   ` Shahbaz Youssefi
       [not found]     ` <CALeOzZ9-wqnvr6_EYDk4FRTfnXTXVi=J7fzU+HSsvVETC0k8eQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Shahbaz Youssefi @ 2014-06-20 15:50 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 2376 bytes --]

That's a good point. It would be great if Pandoc would give an indication
to where the error happened. This happened just recently to me too, which
turned out to be a mistaken CTRL+god_knows_what in insert mode with vim
that had inserted that character.

Nevertheless, I quickly found it with a binary search.


On Fri, Jun 20, 2014 at 4:54 PM, Paulo Ney de Souza <pauloney-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:

> What is your strategy when you need to find a wrongly encoded character in
> a file, after ones of these message from Pandoc ?
>
> pandoc: Cannot decode byte '\xa0': Data.Text.Encoding.Fusion.streamUtf8:
> Invalid UTF-8 stream
>
> Mine is to do a half-baked conversion with "iconv" and compare it to the
> original file:
>
> iconv orig.txt -t utf-8 > tmp
> diff orig.txt tmp
>
> then the first line of the diff is the line of the offending
> character...but this strategy is less than optimal, and if the file has no
> LF then it doesn't help at all.
>
> How do you deal with this?
>
> Paulo Ney
>
>  --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZPoVra7suq3o8LRWf7CxX9xqAA05BgVqxssM7911Ee-kA%40mail.gmail.com
> <https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZPoVra7suq3o8LRWf7CxX9xqAA05BgVqxssM7911Ee-kA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CALeOzZ9-wqnvr6_EYDk4FRTfnXTXVi%3DJ7fzU%2BHSsvVETC0k8eQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 3749 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: wrong encoded character - what is your strategy?
       [not found]     ` <CALeOzZ9-wqnvr6_EYDk4FRTfnXTXVi=J7fzU+HSsvVETC0k8eQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-06-20 16:29       ` John MacFarlane
       [not found]         ` <20140620162931.GC6991-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: John MacFarlane @ 2014-06-20 16:29 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Unfortunately the library I'm using for UTF8 conversion doesn't
give a position.

+++ Shahbaz Youssefi [Jun 20 14 17:50 ]:
>   That's a good point. It would be great if Pandoc would give an
>   indication to where the error happened. This happened just recently to
>   me too, which turned out to be a mistaken CTRL+god_knows_what in insert
>   mode with vim that had inserted that character.
>   Nevertheless, I quickly found it with a binary search.
>
>   On Fri, Jun 20, 2014 at 4:54 PM, Paulo Ney de Souza
>   <[1]pauloney-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>   What is your strategy when you need to find a wrongly encoded character
>   in a file, after ones of these message from Pandoc ?
>   pandoc: Cannot decode byte '\xa0':
>   Data.Text.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream
>   Mine is to do a half-baked conversion with "iconv" and compare it to
>   the original file:
>   iconv orig.txt -t utf-8 > tmp
>   diff orig.txt tmp
>   then the first line of the diff is the line of the offending
>   character...but this strategy is less than optimal, and if the file has
>   no LF then it doesn't help at all.
>   How do you deal with this?
>   Paulo Ney
>
>     --
>     You received this message because you are subscribed to the Google
>     Groups "pandoc-discuss" group.
>     To unsubscribe from this group and stop receiving emails from it,
>     send an email to [2]pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>     To post to this group, send email to
>     [3]pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>     To view this discussion on the web visit
>     [4]https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZPoVra7suq
>     3o8LRWf7CxX9xqAA05BgVqxssM7911Ee-kA%40mail.gmail.com.
>     For more options, visit [5]https://groups.google.com/d/optout.
>
>   --
>   You received this message because you are subscribed to the Google
>   Groups "pandoc-discuss" group.
>   To unsubscribe from this group and stop receiving emails from it, send
>   an email to [6]pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>   To post to this group, send email to
>   [7]pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>   To view this discussion on the web visit
>   [8]https://groups.google.com/d/msgid/pandoc-discuss/CALeOzZ9-wqnvr6_EYD
>   k4FRTfnXTXVi%3DJ7fzU%2BHSsvVETC0k8eQ%40mail.gmail.com.
>   For more options, visit [9]https://groups.google.com/d/optout.
>
>References
>
>   1. mailto:pauloney-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
>   2. mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>   3. mailto:pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>   4. https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZPoVra7suq3o8LRWf7CxX9xqAA05BgVqxssM7911Ee-kA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org?utm_medium=email&utm_source=footer
>   5. https://groups.google.com/d/optout
>   6. mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>   7. mailto:pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>   8. https://groups.google.com/d/msgid/pandoc-discuss/CALeOzZ9-wqnvr6_EYDk4FRTfnXTXVi=J7fzU+HSsvVETC0k8eQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org?utm_medium=email&utm_source=footer
>   9. https://groups.google.com/d/optout


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: wrong encoded character - what is your strategy?
       [not found]         ` <20140620162931.GC6991-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org>
@ 2014-06-20 17:05           ` Paulo Ney de Souza
       [not found]             ` <CAFVhNZPq4pFYWPsqz1G1PAz-qSbwHogYuwnNacq2Tz4C39oJhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Paulo Ney de Souza @ 2014-06-20 17:05 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 1091 bytes --]

John,

Would it be possible to do more in here - possibly by augmenting the
library? It would only be ultra-nice for normal files, but it is also about
to become more important with the ePub reader, because we will not be able
to "iconv" an ePub file!

Paulo Ney


On Fri, Jun 20, 2014 at 11:29 AM, John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> wrote:

> Unfortunately the library I'm using for UTF8 conversion doesn't
> give a position.
>
>>    t <https://groups.google.com/d/optout>
>>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZPq4pFYWPsqz1G1PAz-qSbwHogYuwnNacq2Tz4C39oJhA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 2070 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: wrong encoded character - what is your strategy?
       [not found]             ` <CAFVhNZPq4pFYWPsqz1G1PAz-qSbwHogYuwnNacq2Tz4C39oJhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-06-20 18:28               ` BPJ
       [not found]                 ` <CADAJKhDy+Gji1oShbJnB=qONP+nJXDjjVGPBnZkivodyaJEb_g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: BPJ @ 2014-06-20 18:28 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 2245 bytes --]

Perl's Encode module can be made to return the part before the offending
bytes and the rest as separate strings, which allows you to determine from
context where the offender is.

Do you get these errors often? And do you have an UTF-8 system locale?

/bpj
Den 20 jun 2014 19:05 skrev "Paulo Ney de Souza" <pauloney-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:

> John,
>
> Would it be possible to do more in here - possibly by augmenting the
> library? It would only be ultra-nice for normal files, but it is also about
> to become more important with the ePub reader, because we will not be able
> to "iconv" an ePub file!
>
> Paulo Ney
>
>
> On Fri, Jun 20, 2014 at 11:29 AM, John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org>
> wrote:
>
>> Unfortunately the library I'm using for UTF8 conversion doesn't
>> give a position.
>>
>>>    t <https://groups.google.com/d/optout>
>>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZPq4pFYWPsqz1G1PAz-qSbwHogYuwnNacq2Tz4C39oJhA%40mail.gmail.com
> <https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZPq4pFYWPsqz1G1PAz-qSbwHogYuwnNacq2Tz4C39oJhA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhDy%2BGji1oShbJnB%3DqONP%2BnJXDjjVGPBnZkivodyaJEb_g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 3795 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: wrong encoded character - what is your strategy?
       [not found]                 ` <CADAJKhDy+Gji1oShbJnB=qONP+nJXDjjVGPBnZkivodyaJEb_g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-06-20 18:38                   ` Paulo Ney de Souza
       [not found]                     ` <CAFVhNZMPEZ0XnOXq6-RXDQPXMYz7rf6u6AzdZMoW7mtrFKBMFw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Paulo Ney de Souza @ 2014-06-20 18:38 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 1324 bytes --]

On Fri, Jun 20, 2014 at 1:28 PM, BPJ <bpj-J3H7GcXPSITLoDKTGw+V6w@public.gmane.org> wrote:

> Perl's Encode module can be made to return the part before the offending
> bytes and the rest as separate strings, which allows you to determine from
> context where the offender is.
>

How do you do that ? Can you share the recipe ?


> Do you get these errors often?
>

Yes, incredibly often. We have to deal with it on a daily basis.


> And do you have an UTF-8 system locale?
>
> /bpj
>

These are not files generated on our systems. They are sent by authors all
over the world and the encodings are all over the place! Most of the time
what generates the wrong character is not the system, but some copy &
pasting by the author.

Paulo Ney

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZMPEZ0XnOXq6-RXDQPXMYz7rf6u6AzdZMoW7mtrFKBMFw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 2448 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: wrong encoded character - what is your strategy?
       [not found]                     ` <CAFVhNZMPEZ0XnOXq6-RXDQPXMYz7rf6u6AzdZMoW7mtrFKBMFw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-06-20 21:39                       ` BP Jonsson
  0 siblings, 0 replies; 7+ messages in thread
From: BP Jonsson @ 2014-06-20 21:39 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

2014-06-20 20:38, Paulo Ney de Souza skrev:
>
>
>
> On Fri, Jun 20, 2014 at 1:28 PM, BPJ <bpj-J3H7GcXPSITLoDKTGw+V6w@public.gmane.org
> <mailto:bpj-J3H7GcXPSITLoDKTGw+V6w@public.gmane.org>> wrote:
>
>     Perl's Encode module can be made to return the part before the
>     offending bytes and the rest as separate strings, which allows
>     you to determine from context where the offender is.
>
>
> How do you do that ? Can you share the recipe ?

<https://metacpan.org/pod/Encode#Handling-Malformed-Data>

<https://metacpan.org/pod/Encode#FB_QUIET>

I usually just let it return the string with replacement characters,
the default behavior. Then I then I split the decoded text/string
on newlines and number the lines:

     my $lno = 1;
     my @lines = map { $lno++ . ': ' . $_ } split /\n/, $decoded;

Then I grep for lines containing the replacement character and
print out the found lines, getting a list of numbered lines with
the bad characters something like

     144: Repellendus sed est odit. Sit est m�llitia ea fugiat 
laborum ut.

Luckily the only encodings I usually have to deal with are UTF-8
(the desired one!), Latin-1 and cp1252, and it's obvious from
context which the intruding encoding is. A good heuristic for
distinguishing between the last two is to look for (with a regex)
the bytes which are mapped to quote characters in cp1252. If they
are absent it's usually Latin-1, if they are present it's cp1252.

If you need more exact info on what is wrong the Encode::FB_PERLQQ 
mode
is very informative, since you can then see which the offending 
byte(s)
were.

> These are not files generated on our systems. They are sent by
> authors all over the world and the encodings are all over the
> place! Most of the time what generates the wrong character is not
> the system, but some copy & pasting by the author.

Sounds familiar, although my clientele is not spread out quite as 
much! :-)

/bpj

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/53A4AA00.9040205%40gmail.com.
For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-06-20 21:39 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-20 14:54 wrong encoded character - what is your strategy? Paulo Ney de Souza
     [not found] ` <CAFVhNZPoVra7suq3o8LRWf7CxX9xqAA05BgVqxssM7911Ee-kA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-06-20 15:50   ` Shahbaz Youssefi
     [not found]     ` <CALeOzZ9-wqnvr6_EYDk4FRTfnXTXVi=J7fzU+HSsvVETC0k8eQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-06-20 16:29       ` John MacFarlane
     [not found]         ` <20140620162931.GC6991-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org>
2014-06-20 17:05           ` Paulo Ney de Souza
     [not found]             ` <CAFVhNZPq4pFYWPsqz1G1PAz-qSbwHogYuwnNacq2Tz4C39oJhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-06-20 18:28               ` BPJ
     [not found]                 ` <CADAJKhDy+Gji1oShbJnB=qONP+nJXDjjVGPBnZkivodyaJEb_g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-06-20 18:38                   ` Paulo Ney de Souza
     [not found]                     ` <CAFVhNZMPEZ0XnOXq6-RXDQPXMYz7rf6u6AzdZMoW7mtrFKBMFw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-06-20 21:39                       ` BP Jonsson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).