Gnus development mailing list
 help / color / mirror / Atom feed
* Have Emacs guess the charset?
@ 2001-06-15 14:01 Kai Großjohann
  2001-08-19 20:41 ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 16+ messages in thread
From: Kai Großjohann @ 2001-06-15 14:01 UTC (permalink / raw)


Sometimes I get email which has no predeclared charset.  Emacs assumes
Latin-1 in those cases.  This is good in general.  But is there a way
to have Emacs inspect the current message and suggest a better
charset?

In particular, I sometimes know there is Chinese in it, but I don't
know if it's GB or Big5 encoded.  So I try both until I see a
character I recognize.  Is there a way to have Emacs/Gnus guess
whether it's GB or Big5?

Also, maybe I want to tell Emacs that the default charset for a
specific author is to be something else than Latin-1.  (I guess I can
do this on a per-group basis already.)

Suggestions?

kai
-- 
~/.signature: No such file or directory


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Have Emacs guess the charset?
  2001-06-15 14:01 Have Emacs guess the charset? Kai Großjohann
@ 2001-08-19 20:41 ` Lars Magne Ingebrigtsen
  2001-08-19 23:46   ` Daniel Pittman
  0 siblings, 1 reply; 16+ messages in thread
From: Lars Magne Ingebrigtsen @ 2001-08-19 20:41 UTC (permalink / raw)


Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes:

> Sometimes I get email which has no predeclared charset.  Emacs assumes
> Latin-1 in those cases.  This is good in general.  But is there a way
> to have Emacs inspect the current message and suggest a better
> charset?
>
> In particular, I sometimes know there is Chinese in it, but I don't
> know if it's GB or Big5 encoded.  So I try both until I see a
> character I recognize.  Is there a way to have Emacs/Gnus guess
> whether it's GB or Big5?

Surely there must be some Mule functions for guessing what charset
some text is in, but I have no idea what it's called.  Anybody?

-- 
(domestic pets only, the antidote for overdose, milk.)
   larsi@gnus.org * Lars Magne Ingebrigtsen


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Have Emacs guess the charset?
  2001-08-19 20:41 ` Lars Magne Ingebrigtsen
@ 2001-08-19 23:46   ` Daniel Pittman
  2001-08-20  0:36     ` Lars Magne Ingebrigtsen
  2001-08-21  0:07     ` ShengHuo ZHU
  0 siblings, 2 replies; 16+ messages in thread
From: Daniel Pittman @ 2001-08-19 23:46 UTC (permalink / raw)


On Sun, 19 Aug 2001, Lars Magne Ingebrigtsen wrote:
> Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes:
> 
>> Sometimes I get email which has no predeclared charset. Emacs assumes
>> Latin-1 in those cases. This is good in general. But is there a way
>> to have Emacs inspect the current message and suggest a better
>> charset?
>>
>> In particular, I sometimes know there is Chinese in it, but I don't
>> know if it's GB or Big5 encoded.  So I try both until I see a
>> character I recognize.  Is there a way to have Emacs/Gnus guess
>> whether it's GB or Big5?
> 
> Surely there must be some Mule functions for guessing what charset
> some text is in, but I have no idea what it's called.  Anybody?

`detect-coding-region'
        Daniel

-- 
A man can no more diminish God's glory by refusing to worship Him than a
lunatic can put out the sun by scribbling the word 'darkness' on the walls of
his cell. 
        -- C. S. Lewis, _The problem of pain_


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Have Emacs guess the charset?
  2001-08-19 23:46   ` Daniel Pittman
@ 2001-08-20  0:36     ` Lars Magne Ingebrigtsen
  2001-08-20  1:26       ` Daniel Pittman
  2001-09-01 16:32       ` Dave Love
  2001-08-21  0:07     ` ShengHuo ZHU
  1 sibling, 2 replies; 16+ messages in thread
From: Lars Magne Ingebrigtsen @ 2001-08-20  0:36 UTC (permalink / raw)


Daniel Pittman <daniel@rimspace.net> writes:

> `detect-coding-region'

Thanks.

I've just tried it in an (unmarked) big5 message.  (Well, I'm guessing
it was big5.)  The function returned the following list:

(iso-latin-1-unix raw-text-unix chinese-big5-unix no-conversion)

Which means that the correct answer is the third-most-likely guess...
Is big5 particularly difficult to guess, or is the function bad at
guessing? 

-- 
(domestic pets only, the antidote for overdose, milk.)
   larsi@gnus.org * Lars Magne Ingebrigtsen


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Have Emacs guess the charset?
  2001-08-20  0:36     ` Lars Magne Ingebrigtsen
@ 2001-08-20  1:26       ` Daniel Pittman
  2001-08-20  6:35         ` Lars Magne Ingebrigtsen
  2001-09-01 16:32       ` Dave Love
  1 sibling, 1 reply; 16+ messages in thread
From: Daniel Pittman @ 2001-08-20  1:26 UTC (permalink / raw)


On Mon, 20 Aug 2001, Lars Magne Ingebrigtsen wrote:
> Daniel Pittman <daniel@rimspace.net> writes:
> 
>> `detect-coding-region'
> 
> Thanks.
> 
> I've just tried it in an (unmarked) big5 message.  (Well, I'm guessing
> it was big5.)  The function returned the following list:
> 
> (iso-latin-1-unix raw-text-unix chinese-big5-unix no-conversion)
> 
> Which means that the correct answer is the third-most-likely guess...
> Is big5 particularly difficult to guess, or is the function bad at
> guessing?

I think that the answer is probably "both", but I am not really certain.
I don't know too much about MULE, but both Kai and I have tried to
support it at various times with TRAMP.

So, my recollection of BIG5 encoding is that it is an escaped-in set of
bytes in the 128-255 range, with iso-2022 codeset shifts to get to and
from ASCII.

That means it's probably not that easy to pick. Which, you understand,
does not make the function all that smart.

Under XEmacs, it's pretty simplistic in it's detection of possible
coding system matches.

Er, you did call it on the region that *DID NOT* include the ASCII email
headers, right?

If that's true, I guess that you are pretty much short of luck. :(
        Daniel

-- 
There is censorship in this country, all right, make no mistake about that,
but also make no mistake about its source...While the government will not
censor, apparently the networks will. The irreparable damage to the public is
all the same.
        -- Nicholas Johnson, Federal Communications Commissioner, _
           New York Times_, (April 8, 1969)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Have Emacs guess the charset?
  2001-08-20  1:26       ` Daniel Pittman
@ 2001-08-20  6:35         ` Lars Magne Ingebrigtsen
  2001-08-20  7:26           ` Daniel Pittman
  0 siblings, 1 reply; 16+ messages in thread
From: Lars Magne Ingebrigtsen @ 2001-08-20  6:35 UTC (permalink / raw)


Daniel Pittman <daniel@rimspace.net> writes:

> Er, you did call it on the region that *DID NOT* include the ASCII email
> headers, right?

Yup.

> If that's true, I guess that you are pretty much short of luck. :(

Darn.

-- 
(domestic pets only, the antidote for overdose, milk.)
   larsi@gnus.org * Lars Magne Ingebrigtsen


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Have Emacs guess the charset?
  2001-08-20  6:35         ` Lars Magne Ingebrigtsen
@ 2001-08-20  7:26           ` Daniel Pittman
  2001-08-20  8:25             ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Pittman @ 2001-08-20  7:26 UTC (permalink / raw)


On Mon, 20 Aug 2001, Lars Magne Ingebrigtsen wrote:
> Daniel Pittman <daniel@rimspace.net> writes:
> 
>> Er, you did call it on the region that *DID NOT* include the ASCII
>> email headers, right?
> 
> Yup.
> 
>> If that's true, I guess that you are pretty much short of luck. :(
> 
> Darn.

Well, if someone were to forward the message to me, I could see if
XEmacs were any more clever than GNU Emacs. Better still, I could then
forward that on and ask the MULE hackers what they think about doing it.

There is also `detect-coding-with-priority' which takes a list of
priorities, at last under XEmacs.

You could supply a list of the various far-east encodings to that, which
/should/ prefer those to the western encodings. Not that the function is
likely to do much for the western encodings anyway.

        Daniel

-- 
An expert is a person who has made all the mistakes which can be made
in a very narrow field.
        -- Niels Bohr


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Have Emacs guess the charset?
  2001-08-20  7:26           ` Daniel Pittman
@ 2001-08-20  8:25             ` Lars Magne Ingebrigtsen
  2001-08-20  9:13               ` Daniel Pittman
  2001-08-20  9:34               ` Kai Großjohann
  0 siblings, 2 replies; 16+ messages in thread
From: Lars Magne Ingebrigtsen @ 2001-08-20  8:25 UTC (permalink / raw)


Daniel Pittman <daniel@rimspace.net> writes:

> Well, if someone were to forward the message to me, I could see if
> XEmacs were any more clever than GNU Emacs. Better still, I could then
> forward that on and ask the MULE hackers what they think about doing it.

What -- you mean you don't get any spam in mangled big5?  Wow.

Anyway, I'll mail you one...

> You could supply a list of the various far-east encodings to that, which
> /should/ prefer those to the western encodings. Not that the function is
> likely to do much for the western encodings anyway.

But we don't know what encodings there are in the buffer.

But I guess we could look for clues -- for instance, if any of the
headers are encoded with big5 (and marked as such), we could use that
as a clue...

-- 
(domestic pets only, the antidote for overdose, milk.)
   larsi@gnus.org * Lars Magne Ingebrigtsen


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Have Emacs guess the charset?
  2001-08-20  8:25             ` Lars Magne Ingebrigtsen
@ 2001-08-20  9:13               ` Daniel Pittman
  2001-08-20 10:02                 ` Lars Magne Ingebrigtsen
  2001-08-20  9:34               ` Kai Großjohann
  1 sibling, 1 reply; 16+ messages in thread
From: Daniel Pittman @ 2001-08-20  9:13 UTC (permalink / raw)


On Mon, 20 Aug 2001, Lars Magne Ingebrigtsen wrote:
> Daniel Pittman <daniel@rimspace.net> writes:
> 
>> Well, if someone were to forward the message to me, I could see if
>> XEmacs were any more clever than GNU Emacs. Better still, I could
>> then forward that on and ask the MULE hackers what they think about
>> doing it.
> 
> What -- you mean you don't get any spam in mangled big5?  Wow.

Heh. I don't /keep/ any spam mangled in big5, and I don't fancy waiting
long enough for some to arrive. 

> Anyway, I'll mail you one...

Cool.

>> You could supply a list of the various far-east encodings to that,
>> which /should/ prefer those to the western encodings. Not that the
>> function is likely to do much for the western encodings anyway.
> 
> But we don't know what encodings there are in the buffer.
> 
> But I guess we could look for clues -- for instance, if any of the
> headers are encoded with big5 (and marked as such), we could use that
> as a clue...

That could be it. Otherwise, you could try just giving a set of likely
encodings such as big5, which will not be autodetected in any standard
western character set.

Not that this is such a great idea... at least, not unless it's
customizable by the user.

             Daniel

-- 
Christianity has not been tried and found wanting;
it has been found difficult and not tried.
        -- Gilbert K. Chesterton


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Have Emacs guess the charset?
  2001-08-20  8:25             ` Lars Magne Ingebrigtsen
  2001-08-20  9:13               ` Daniel Pittman
@ 2001-08-20  9:34               ` Kai Großjohann
  2001-08-20 10:02                 ` Lars Magne Ingebrigtsen
  1 sibling, 1 reply; 16+ messages in thread
From: Kai Großjohann @ 2001-08-20  9:34 UTC (permalink / raw)


Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> Daniel Pittman <daniel@rimspace.net> writes:
>
>> You could supply a list of the various far-east encodings to that, which
>> /should/ prefer those to the western encodings. Not that the function is
>> likely to do much for the western encodings anyway.
>
> But we don't know what encodings there are in the buffer.

Well, as a first step it is enough for me to be able to say: "I think
this is Chinese, what does it say?"  This is better than having to ask
"Show it to me in GB -- does it make sense?"

One little problem is that I don't speak or read Chinese and both
showing it in GB and showing it in Big5 leads to apparently meaningful
Chinese characters -- just different ones.  So unless you know what
characters to look for, it's not easy to say which one of the two is
right.

But on the other hand, if a human can't tell, maybe Emacs can't tell,
either.  And on the third hand, Emacs might know more Chinese than I
do...

kai
-- 
Symbol's function definition is void: signature


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Have Emacs guess the charset?
  2001-08-20  9:13               ` Daniel Pittman
@ 2001-08-20 10:02                 ` Lars Magne Ingebrigtsen
  0 siblings, 0 replies; 16+ messages in thread
From: Lars Magne Ingebrigtsen @ 2001-08-20 10:02 UTC (permalink / raw)


Daniel Pittman <daniel@rimspace.net> writes:

> Not that this is such a great idea... at least, not unless it's
> customizable by the user.

It's just meant to be used as the default value on the prompt for
`0 g'.

-- 
(domestic pets only, the antidote for overdose, milk.)
   larsi@gnus.org * Lars Magne Ingebrigtsen


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Have Emacs guess the charset?
  2001-08-20  9:34               ` Kai Großjohann
@ 2001-08-20 10:02                 ` Lars Magne Ingebrigtsen
  0 siblings, 0 replies; 16+ messages in thread
From: Lars Magne Ingebrigtsen @ 2001-08-20 10:02 UTC (permalink / raw)


Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes:

> Well, as a first step it is enough for me to be able to say: "I think
> this is Chinese, what does it say?"  This is better than having to ask
> "Show it to me in GB -- does it make sense?"

But we don't have a mapping of "chinese" -> likely encodings, do we?

-- 
(domestic pets only, the antidote for overdose, milk.)
   larsi@gnus.org * Lars Magne Ingebrigtsen


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Have Emacs guess the charset?
  2001-08-19 23:46   ` Daniel Pittman
  2001-08-20  0:36     ` Lars Magne Ingebrigtsen
@ 2001-08-21  0:07     ` ShengHuo ZHU
  2001-08-21 21:58       ` Lars Magne Ingebrigtsen
  2001-09-01 16:32       ` Dave Love
  1 sibling, 2 replies; 16+ messages in thread
From: ShengHuo ZHU @ 2001-08-21  0:07 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 974 bytes --]

Daniel Pittman <daniel@rimspace.net> writes:

> On Sun, 19 Aug 2001, Lars Magne Ingebrigtsen wrote:
> > Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes:
>> 
>>> Sometimes I get email which has no predeclared charset. Emacs assumes
>>> Latin-1 in those cases. This is good in general. But is there a way
>>> to have Emacs inspect the current message and suggest a better
>>> charset?
>>>
>>> In particular, I sometimes know there is Chinese in it, but I don't
>>> know if it's GB or Big5 encoded.  So I try both until I see a
>>> character I recognize.  Is there a way to have Emacs/Gnus guess
>>> whether it's GB or Big5?
>> 
>> Surely there must be some Mule functions for guessing what charset
>> some text is in, but I have no idea what it's called.  Anybody?
>
> `detect-coding-region'

I tested the function on the attached files in both Emacs 21 and
XEmacs 21.4.  I found the one in XEmacs did a good job, but the one in
Emacs is almost useless.

ShengHuo


[-- Attachment #2: Big5.tex --]
[-- Type: application/octet-stream, Size: 565 bytes --]

% This is the file Big5.tex of the CJK package
%   for testing Chinese (in Big 5 encoding).
%
% written by Werner Lemberg <wl@gnu.org>
%
% Version 4.2.0 (13-Dec-1998)
%
%
% process this file with bg5latex

\documentclass[12pt]{article} 

\usepackage{CJK}


\begin{document}

\begin{CJK*}{Bg5}{song}
\CJKtilde

\noindent ¥»±`°Ý°Ýµª¶°~(FAQ list)~¬O±q¤@¨Ç¸g±`³Q°Ý¨ìªº°ÝÃD¤Î¨ä¾A·íªº¸Ñ
µª¤¤¡A¥H¤è«Kªº§Î¦¡ºK­n¦Ó¥Xªº¡C¸ò¤W¤@ª©¤£¦Pªº¬O¡A¨ä½s±Æµ²ºc¤w¹ý©³§ïÅÜ¡C
\textbf{¦³Ãö·sµ²ºcªº²Ó¸`¡A¥i°Ñ¦Ò¡u¦p¦ó¾\Ū¥»°Ýµª¶°¤Î¤F¸Ñ¨ä½s±Æµ²ºc¡v¸Ó
¶µ¤¤ªº»¡©ú¡C}

\end{CJK*}

\end{document}

[-- Attachment #3: GB.tex --]
[-- Type: application/octet-stream, Size: 520 bytes --]

% This is the file GB.tex of the CJK package
%   for testing Chinese (in GB encoding).
%
% written by Werner Lemberg <wl@gnu.org>
%
% Version 4.2.0 (13-Dec-1998)

\documentclass[12pt]{article}

\usepackage{CJK}


\begin{document}

\begin{CJK*}{GB}{song}
\CJKtilde

\noindent ±¾³£ÎÊÎÊ´ð¼¯~(FAQ list)~ÊÇ´ÓһЩ¾­³£±»Îʵ½µÄÎÊÌâ¼°ÆäÊʵ±µÄ½â
´ðÖУ¬ÒÔ·½±ãµÄÐÎʽժҪ¶ø³öµÄ¡£¸úÉÏÒ»°æ²»Í¬µÄÊÇ£¬Æä±àÅŽṹÒѳ¹µ×¸Ä±ä¡£
\textbf{ÓйØнṹµÄϸ½Ú£¬¿É²Î¿¼¡¸ÈçºÎÔĶÁ±¾Îʴ𼯼°Á˽âÆä±àÅŽṹ¡¹¸Ã
ÏîÖеÄ˵Ã÷¡£}

\end{CJK*}

\end{document}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Have Emacs guess the charset?
  2001-08-21  0:07     ` ShengHuo ZHU
@ 2001-08-21 21:58       ` Lars Magne Ingebrigtsen
  2001-09-01 16:32       ` Dave Love
  1 sibling, 0 replies; 16+ messages in thread
From: Lars Magne Ingebrigtsen @ 2001-08-21 21:58 UTC (permalink / raw)


ShengHuo ZHU <zsh@cs.rochester.edu> writes:

> I tested the function on the attached files in both Emacs 21 and
> XEmacs 21.4.  I found the one in XEmacs did a good job, but the one in
> Emacs is almost useless.

Right -- that would explain what I'm seeing.  Well, I'll just leave
the prompting code as it is in `gnus-summary-show-article' -- it'll
provide a better default for XEmacs users, and it won't hurt Emacs
users any.

-- 
(domestic pets only, the antidote for overdose, milk.)
   larsi@gnus.org * Lars Magne Ingebrigtsen


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Have Emacs guess the charset?
  2001-08-21  0:07     ` ShengHuo ZHU
  2001-08-21 21:58       ` Lars Magne Ingebrigtsen
@ 2001-09-01 16:32       ` Dave Love
  1 sibling, 0 replies; 16+ messages in thread
From: Dave Love @ 2001-09-01 16:32 UTC (permalink / raw)


>>>>> "ZSH" == ShengHuo ZHU <zsh@cs.rochester.edu> writes:

 >> `detect-coding-region'

 ZSH> I tested the function on the attached files in both Emacs 21 and
 ZSH> XEmacs 21.4.  I found the one in XEmacs did a good job, but the one in
 ZSH> Emacs is almost useless.

What does that mean?  The function detects the Big5 for me in Emacs.
It won't detect GB outside an appropriate language environment,
because there's no coding category registered for it.  If Emacs
doesn't detect an encoding with an appropriate entry in
`coding-category-list' it's a bug.

However, I think you're on a hiding to nothing with this, even if you
have the corresponding coding system with which to decode it.  (E.g.
_I_ have windows-1252, but most people don't, and you couldn't
typically distinguish it from other CCL coding systems which use most,
or all, octets.)






^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Have Emacs guess the charset?
  2001-08-20  0:36     ` Lars Magne Ingebrigtsen
  2001-08-20  1:26       ` Daniel Pittman
@ 2001-09-01 16:32       ` Dave Love
  1 sibling, 0 replies; 16+ messages in thread
From: Dave Love @ 2001-09-01 16:32 UTC (permalink / raw)


>>>>> "LMI" == Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

 LMI> Which means that the correct answer is the third-most-likely
 LMI> guess...  Is big5 particularly difficult to guess, or is the
 LMI> function bad at guessing?

`Return a list of possible coding systems ordered by priority.'

M-x apropos priority


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2001-09-01 16:32 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-06-15 14:01 Have Emacs guess the charset? Kai Großjohann
2001-08-19 20:41 ` Lars Magne Ingebrigtsen
2001-08-19 23:46   ` Daniel Pittman
2001-08-20  0:36     ` Lars Magne Ingebrigtsen
2001-08-20  1:26       ` Daniel Pittman
2001-08-20  6:35         ` Lars Magne Ingebrigtsen
2001-08-20  7:26           ` Daniel Pittman
2001-08-20  8:25             ` Lars Magne Ingebrigtsen
2001-08-20  9:13               ` Daniel Pittman
2001-08-20 10:02                 ` Lars Magne Ingebrigtsen
2001-08-20  9:34               ` Kai Großjohann
2001-08-20 10:02                 ` Lars Magne Ingebrigtsen
2001-09-01 16:32       ` Dave Love
2001-08-21  0:07     ` ShengHuo ZHU
2001-08-21 21:58       ` Lars Magne Ingebrigtsen
2001-09-01 16:32       ` Dave Love

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).