Gnus development mailing list
 help / color / mirror / Atom feed
* Invalid characters, or something else?
@ 2010-10-17  0:05 Lars Magne Ingebrigtsen
  2010-10-17  0:20 ` Russ Allbery
  0 siblings, 1 reply; 12+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-10-17  0:05 UTC (permalink / raw)
  To: ding

Some Gwene groups have some characters that can't be parsed as utf-8 (or
something).  It's usually dash characters or the like.

Like this:

---
other easy activities—talking, chewing gum—and suggest 
---

Is that valid?  If not, what's the likely explanation where it's coming
from?

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Invalid characters, or something else?
  2010-10-17  0:05 Invalid characters, or something else? Lars Magne Ingebrigtsen
@ 2010-10-17  0:20 ` Russ Allbery
  2010-10-17 13:07   ` Lars Magne Ingebrigtsen
  2010-10-17 21:30   ` Kevin Ryde
  0 siblings, 2 replies; 12+ messages in thread
From: Russ Allbery @ 2010-10-17  0:20 UTC (permalink / raw)
  To: ding

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> Some Gwene groups have some characters that can't be parsed as utf-8 (or
> something).  It's usually dash characters or the like.

> Like this:

> ---
> other easy activities—talking, chewing gum—and suggest 
> ---

> Is that valid?  If not, what's the likely explanation where it's coming
> from?

They're from Windows-1252 (or 1250).  Octal 227 is a dash in that charset.
Microsoft users tend to send unlabelled Windows code pages.

http://en.wikipedia.org/wiki/Windows-1252

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Invalid characters, or something else?
  2010-10-17  0:20 ` Russ Allbery
@ 2010-10-17 13:07   ` Lars Magne Ingebrigtsen
  2010-10-17 13:17     ` Lars Magne Ingebrigtsen
  2010-10-17 21:30   ` Kevin Ryde
  1 sibling, 1 reply; 12+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-10-17 13:07 UTC (permalink / raw)
  To: ding

Russ Allbery <rra@stanford.edu> writes:

> They're from Windows-1252 (or 1250).  Octal 227 is a dash in that charset.
> Microsoft users tend to send unlabelled Windows code pages.
>
> http://en.wikipedia.org/wiki/Windows-1252

Hm...  should this be handled by Emacs, Gnus or Gwene?  :-)

It's probably easier to do on the Gwene side...  is there some Perl
library to translate from these characters to, er, something that works?

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Invalid characters, or something else?
  2010-10-17 13:07   ` Lars Magne Ingebrigtsen
@ 2010-10-17 13:17     ` Lars Magne Ingebrigtsen
  2010-10-17 13:22       ` Lars Magne Ingebrigtsen
  2010-10-17 21:33       ` James Cloos
  0 siblings, 2 replies; 12+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-10-17 13:17 UTC (permalink / raw)
  To: ding

Gnus does have this already:

(defvar gnus-article-dumbquotes-map
  '(("\200" "EUR")
    ("\202" ",")
    ("\203" "f")
    ("\204" ",,")
    ("\205" "...")
    ("\213" "<")
    ("\214" "OE")
    ("\221" "`")
    ("\222" "'")
    ("\223" "``")
    ("\224" "\"")
    ("\225" "*")
    ("\226" "-")
    ("\227" "--")
    ("\230" "~")
    ("\231" "(TM)")
    ("\233" ">")
    ("\234" "oe")
    ("\264" "'"))
  "Table for MS-to-Latin1 translation.")

But it doesn't quite work, because the text from Gwene has

\302\227

instead of just

\227

So this could probably be fixed pretty easily on the Gwene side, but it
should map the smartquotes to Unicode characters, not ASCII.  Anybody
have such a table?

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Invalid characters, or something else?
  2010-10-17 13:17     ` Lars Magne Ingebrigtsen
@ 2010-10-17 13:22       ` Lars Magne Ingebrigtsen
  2010-10-17 13:34         ` Lars Magne Ingebrigtsen
  2010-10-17 21:33       ` James Cloos
  1 sibling, 1 reply; 12+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-10-17 13:22 UTC (permalink / raw)
  To: ding

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> So this could probably be fixed pretty easily on the Gwene side, but it
> should map the smartquotes to Unicode characters, not ASCII.  Anybody
> have such a table?

Or even better, a mapping from these characters to ascii &symbols;.  But
wikipedia has a table, so I could just map them to &#2013;, I guess?
Yeah, that should work.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Invalid characters, or something else?
  2010-10-17 13:22       ` Lars Magne Ingebrigtsen
@ 2010-10-17 13:34         ` Lars Magne Ingebrigtsen
  2010-10-17 17:57           ` Lars Magne Ingebrigtsen
  2010-10-17 21:34           ` James Cloos
  0 siblings, 2 replies; 12+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-10-17 13:34 UTC (permalink / raw)
  To: ding

Here's the table, if anybody out on google needs a table to translate
between Windows-1252 and HTML Unicode entities:

("128", "&#20AC;"),
("130", "&#201A;"),
("131", "&#0192;"),
("132", "&#201E;"),
("133", "&#2026;"),
("134", "&#2020;"),
("135", "&#2021;"),
("136", "&#02C6;"),
("137", "&#2030;"),
("138", "&#0160;"),
("139", "&#2039;"),
("140", "&#0152;"),
("142", "&#017D;"),
("145", "&#2018;"),
("146", "&#2019;"),
("147", "&#201C;"),
("148", "&#201D;"),
("149", "&#2022;"),
("150", "&#2013;"),
("151", "&#2014;"),
("152", "&#02DC;"),
("153", "&#2122;"),
("154", "&#0161;"),
("155", "&#203A;"),
("156", "&#0153;"),
("158", "&#017E;"),
("159", "&#0178;"),

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Invalid characters, or something else?
  2010-10-17 13:34         ` Lars Magne Ingebrigtsen
@ 2010-10-17 17:57           ` Lars Magne Ingebrigtsen
  2010-10-17 21:34           ` James Cloos
  1 sibling, 0 replies; 12+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-10-17 17:57 UTC (permalink / raw)
  To: ding

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> Here's the table, if anybody out on google needs a table to translate
> between Windows-1252 and HTML Unicode entities:
>
> ("128", "&#20AC;"),

That should be "&#x20AC;", I guess.

And then there's stuff like this:

&#151;

Which is even more annoying, because it just ends up like this: \227

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Invalid characters, or something else?
  2010-10-17  0:20 ` Russ Allbery
  2010-10-17 13:07   ` Lars Magne Ingebrigtsen
@ 2010-10-17 21:30   ` Kevin Ryde
  1 sibling, 0 replies; 12+ messages in thread
From: Kevin Ryde @ 2010-10-17 21:30 UTC (permalink / raw)
  To: ding

Russ Allbery <rra@stanford.edu> writes:
>
> unlabelled Windows code pages.

Or labelled as <?xml encoding="iso-8859-1"> but in fact dos 1252 :-(.
I suppose if you assume there won't be any 8-bit ansi control char
thingies then could decode as 1252 instead automatically.  The blame
belongs squarely with the feed generator of course and it seems bad to
pander to the worst ones, even if it works out in practice, especially
as bad feeds are each bad in their own different way :-).  For what it's
worth in my own program I chucked in a charset override option and left
it at that (having also struck some latin-1 vs utf-8 feed
mis-declarations ...).



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Invalid characters, or something else?
  2010-10-17 13:17     ` Lars Magne Ingebrigtsen
  2010-10-17 13:22       ` Lars Magne Ingebrigtsen
@ 2010-10-17 21:33       ` James Cloos
  2010-10-18  6:11         ` Reiner Steib
  2010-10-18 18:32         ` Lars Magne Ingebrigtsen
  1 sibling, 2 replies; 12+ messages in thread
From: James Cloos @ 2010-10-17 21:33 UTC (permalink / raw)
  To: ding

>>>>> "LMI" == Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

LMI> Gnus does have this already:
LMI> (defvar gnus-article-dumbquotes-map

FWIW, the dumbquotes-map hasn't worked for me since I started using
the unicode branch (which was the first big change merged into trunk
for Emacs-24).

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Invalid characters, or something else?
  2010-10-17 13:34         ` Lars Magne Ingebrigtsen
  2010-10-17 17:57           ` Lars Magne Ingebrigtsen
@ 2010-10-17 21:34           ` James Cloos
  1 sibling, 0 replies; 12+ messages in thread
From: James Cloos @ 2010-10-17 21:34 UTC (permalink / raw)
  To: ding

>>>>> "LMI" == Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

LMI> Here's the table, if anybody out on google needs a table to
LMI> translate between Windows-1252 and HTML Unicode entities:

The definitive package is Markus Kuhn's uniset:

    http://www.cl.cam.ac.uk/~mgk25/download/uniset.tar.gz

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Invalid characters, or something else?
  2010-10-17 21:33       ` James Cloos
@ 2010-10-18  6:11         ` Reiner Steib
  2010-10-18 18:32         ` Lars Magne Ingebrigtsen
  1 sibling, 0 replies; 12+ messages in thread
From: Reiner Steib @ 2010-10-18  6:11 UTC (permalink / raw)
  To: ding

On Sun, Oct 17 2010, James Cloos wrote:

> FWIW, the dumbquotes-map hasn't worked for me since I started using
> the unicode branch (which was the first big change merged into trunk
> for Emacs-24).

Emacs 23.

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Invalid characters, or something else?
  2010-10-17 21:33       ` James Cloos
  2010-10-18  6:11         ` Reiner Steib
@ 2010-10-18 18:32         ` Lars Magne Ingebrigtsen
  1 sibling, 0 replies; 12+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-10-18 18:32 UTC (permalink / raw)
  To: ding

James Cloos <cloos@jhcloos.com> writes:

> FWIW, the dumbquotes-map hasn't worked for me since I started using
> the unicode branch (which was the first big change merged into trunk
> for Emacs-24).

Didn't work for me, either.  Fixed now.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-10-18 18:32 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-17  0:05 Invalid characters, or something else? Lars Magne Ingebrigtsen
2010-10-17  0:20 ` Russ Allbery
2010-10-17 13:07   ` Lars Magne Ingebrigtsen
2010-10-17 13:17     ` Lars Magne Ingebrigtsen
2010-10-17 13:22       ` Lars Magne Ingebrigtsen
2010-10-17 13:34         ` Lars Magne Ingebrigtsen
2010-10-17 17:57           ` Lars Magne Ingebrigtsen
2010-10-17 21:34           ` James Cloos
2010-10-17 21:33       ` James Cloos
2010-10-18  6:11         ` Reiner Steib
2010-10-18 18:32         ` Lars Magne Ingebrigtsen
2010-10-17 21:30   ` Kevin Ryde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).