* Have Emacs guess the charset? @ 2001-06-15 14:01 Kai Großjohann 2001-08-19 20:41 ` Lars Magne Ingebrigtsen 0 siblings, 1 reply; 16+ messages in thread From: Kai Großjohann @ 2001-06-15 14:01 UTC (permalink / raw) Sometimes I get email which has no predeclared charset. Emacs assumes Latin-1 in those cases. This is good in general. But is there a way to have Emacs inspect the current message and suggest a better charset? In particular, I sometimes know there is Chinese in it, but I don't know if it's GB or Big5 encoded. So I try both until I see a character I recognize. Is there a way to have Emacs/Gnus guess whether it's GB or Big5? Also, maybe I want to tell Emacs that the default charset for a specific author is to be something else than Latin-1. (I guess I can do this on a per-group basis already.) Suggestions? kai -- ~/.signature: No such file or directory ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Have Emacs guess the charset? 2001-06-15 14:01 Have Emacs guess the charset? Kai Großjohann @ 2001-08-19 20:41 ` Lars Magne Ingebrigtsen 2001-08-19 23:46 ` Daniel Pittman 0 siblings, 1 reply; 16+ messages in thread From: Lars Magne Ingebrigtsen @ 2001-08-19 20:41 UTC (permalink / raw) Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes: > Sometimes I get email which has no predeclared charset. Emacs assumes > Latin-1 in those cases. This is good in general. But is there a way > to have Emacs inspect the current message and suggest a better > charset? > > In particular, I sometimes know there is Chinese in it, but I don't > know if it's GB or Big5 encoded. So I try both until I see a > character I recognize. Is there a way to have Emacs/Gnus guess > whether it's GB or Big5? Surely there must be some Mule functions for guessing what charset some text is in, but I have no idea what it's called. Anybody? -- (domestic pets only, the antidote for overdose, milk.) larsi@gnus.org * Lars Magne Ingebrigtsen ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Have Emacs guess the charset? 2001-08-19 20:41 ` Lars Magne Ingebrigtsen @ 2001-08-19 23:46 ` Daniel Pittman 2001-08-20 0:36 ` Lars Magne Ingebrigtsen 2001-08-21 0:07 ` ShengHuo ZHU 0 siblings, 2 replies; 16+ messages in thread From: Daniel Pittman @ 2001-08-19 23:46 UTC (permalink / raw) On Sun, 19 Aug 2001, Lars Magne Ingebrigtsen wrote: > Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes: > >> Sometimes I get email which has no predeclared charset. Emacs assumes >> Latin-1 in those cases. This is good in general. But is there a way >> to have Emacs inspect the current message and suggest a better >> charset? >> >> In particular, I sometimes know there is Chinese in it, but I don't >> know if it's GB or Big5 encoded. So I try both until I see a >> character I recognize. Is there a way to have Emacs/Gnus guess >> whether it's GB or Big5? > > Surely there must be some Mule functions for guessing what charset > some text is in, but I have no idea what it's called. Anybody? `detect-coding-region' Daniel -- A man can no more diminish God's glory by refusing to worship Him than a lunatic can put out the sun by scribbling the word 'darkness' on the walls of his cell. -- C. S. Lewis, _The problem of pain_ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Have Emacs guess the charset? 2001-08-19 23:46 ` Daniel Pittman @ 2001-08-20 0:36 ` Lars Magne Ingebrigtsen 2001-08-20 1:26 ` Daniel Pittman 2001-09-01 16:32 ` Dave Love 2001-08-21 0:07 ` ShengHuo ZHU 1 sibling, 2 replies; 16+ messages in thread From: Lars Magne Ingebrigtsen @ 2001-08-20 0:36 UTC (permalink / raw) Daniel Pittman <daniel@rimspace.net> writes: > `detect-coding-region' Thanks. I've just tried it in an (unmarked) big5 message. (Well, I'm guessing it was big5.) The function returned the following list: (iso-latin-1-unix raw-text-unix chinese-big5-unix no-conversion) Which means that the correct answer is the third-most-likely guess... Is big5 particularly difficult to guess, or is the function bad at guessing? -- (domestic pets only, the antidote for overdose, milk.) larsi@gnus.org * Lars Magne Ingebrigtsen ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Have Emacs guess the charset? 2001-08-20 0:36 ` Lars Magne Ingebrigtsen @ 2001-08-20 1:26 ` Daniel Pittman 2001-08-20 6:35 ` Lars Magne Ingebrigtsen 2001-09-01 16:32 ` Dave Love 1 sibling, 1 reply; 16+ messages in thread From: Daniel Pittman @ 2001-08-20 1:26 UTC (permalink / raw) On Mon, 20 Aug 2001, Lars Magne Ingebrigtsen wrote: > Daniel Pittman <daniel@rimspace.net> writes: > >> `detect-coding-region' > > Thanks. > > I've just tried it in an (unmarked) big5 message. (Well, I'm guessing > it was big5.) The function returned the following list: > > (iso-latin-1-unix raw-text-unix chinese-big5-unix no-conversion) > > Which means that the correct answer is the third-most-likely guess... > Is big5 particularly difficult to guess, or is the function bad at > guessing? I think that the answer is probably "both", but I am not really certain. I don't know too much about MULE, but both Kai and I have tried to support it at various times with TRAMP. So, my recollection of BIG5 encoding is that it is an escaped-in set of bytes in the 128-255 range, with iso-2022 codeset shifts to get to and from ASCII. That means it's probably not that easy to pick. Which, you understand, does not make the function all that smart. Under XEmacs, it's pretty simplistic in it's detection of possible coding system matches. Er, you did call it on the region that *DID NOT* include the ASCII email headers, right? If that's true, I guess that you are pretty much short of luck. :( Daniel -- There is censorship in this country, all right, make no mistake about that, but also make no mistake about its source...While the government will not censor, apparently the networks will. The irreparable damage to the public is all the same. -- Nicholas Johnson, Federal Communications Commissioner, _ New York Times_, (April 8, 1969) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Have Emacs guess the charset? 2001-08-20 1:26 ` Daniel Pittman @ 2001-08-20 6:35 ` Lars Magne Ingebrigtsen 2001-08-20 7:26 ` Daniel Pittman 0 siblings, 1 reply; 16+ messages in thread From: Lars Magne Ingebrigtsen @ 2001-08-20 6:35 UTC (permalink / raw) Daniel Pittman <daniel@rimspace.net> writes: > Er, you did call it on the region that *DID NOT* include the ASCII email > headers, right? Yup. > If that's true, I guess that you are pretty much short of luck. :( Darn. -- (domestic pets only, the antidote for overdose, milk.) larsi@gnus.org * Lars Magne Ingebrigtsen ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Have Emacs guess the charset? 2001-08-20 6:35 ` Lars Magne Ingebrigtsen @ 2001-08-20 7:26 ` Daniel Pittman 2001-08-20 8:25 ` Lars Magne Ingebrigtsen 0 siblings, 1 reply; 16+ messages in thread From: Daniel Pittman @ 2001-08-20 7:26 UTC (permalink / raw) On Mon, 20 Aug 2001, Lars Magne Ingebrigtsen wrote: > Daniel Pittman <daniel@rimspace.net> writes: > >> Er, you did call it on the region that *DID NOT* include the ASCII >> email headers, right? > > Yup. > >> If that's true, I guess that you are pretty much short of luck. :( > > Darn. Well, if someone were to forward the message to me, I could see if XEmacs were any more clever than GNU Emacs. Better still, I could then forward that on and ask the MULE hackers what they think about doing it. There is also `detect-coding-with-priority' which takes a list of priorities, at last under XEmacs. You could supply a list of the various far-east encodings to that, which /should/ prefer those to the western encodings. Not that the function is likely to do much for the western encodings anyway. Daniel -- An expert is a person who has made all the mistakes which can be made in a very narrow field. -- Niels Bohr ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Have Emacs guess the charset? 2001-08-20 7:26 ` Daniel Pittman @ 2001-08-20 8:25 ` Lars Magne Ingebrigtsen 2001-08-20 9:13 ` Daniel Pittman 2001-08-20 9:34 ` Kai Großjohann 0 siblings, 2 replies; 16+ messages in thread From: Lars Magne Ingebrigtsen @ 2001-08-20 8:25 UTC (permalink / raw) Daniel Pittman <daniel@rimspace.net> writes: > Well, if someone were to forward the message to me, I could see if > XEmacs were any more clever than GNU Emacs. Better still, I could then > forward that on and ask the MULE hackers what they think about doing it. What -- you mean you don't get any spam in mangled big5? Wow. Anyway, I'll mail you one... > You could supply a list of the various far-east encodings to that, which > /should/ prefer those to the western encodings. Not that the function is > likely to do much for the western encodings anyway. But we don't know what encodings there are in the buffer. But I guess we could look for clues -- for instance, if any of the headers are encoded with big5 (and marked as such), we could use that as a clue... -- (domestic pets only, the antidote for overdose, milk.) larsi@gnus.org * Lars Magne Ingebrigtsen ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Have Emacs guess the charset? 2001-08-20 8:25 ` Lars Magne Ingebrigtsen @ 2001-08-20 9:13 ` Daniel Pittman 2001-08-20 10:02 ` Lars Magne Ingebrigtsen 2001-08-20 9:34 ` Kai Großjohann 1 sibling, 1 reply; 16+ messages in thread From: Daniel Pittman @ 2001-08-20 9:13 UTC (permalink / raw) On Mon, 20 Aug 2001, Lars Magne Ingebrigtsen wrote: > Daniel Pittman <daniel@rimspace.net> writes: > >> Well, if someone were to forward the message to me, I could see if >> XEmacs were any more clever than GNU Emacs. Better still, I could >> then forward that on and ask the MULE hackers what they think about >> doing it. > > What -- you mean you don't get any spam in mangled big5? Wow. Heh. I don't /keep/ any spam mangled in big5, and I don't fancy waiting long enough for some to arrive. > Anyway, I'll mail you one... Cool. >> You could supply a list of the various far-east encodings to that, >> which /should/ prefer those to the western encodings. Not that the >> function is likely to do much for the western encodings anyway. > > But we don't know what encodings there are in the buffer. > > But I guess we could look for clues -- for instance, if any of the > headers are encoded with big5 (and marked as such), we could use that > as a clue... That could be it. Otherwise, you could try just giving a set of likely encodings such as big5, which will not be autodetected in any standard western character set. Not that this is such a great idea... at least, not unless it's customizable by the user. Daniel -- Christianity has not been tried and found wanting; it has been found difficult and not tried. -- Gilbert K. Chesterton ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Have Emacs guess the charset? 2001-08-20 9:13 ` Daniel Pittman @ 2001-08-20 10:02 ` Lars Magne Ingebrigtsen 0 siblings, 0 replies; 16+ messages in thread From: Lars Magne Ingebrigtsen @ 2001-08-20 10:02 UTC (permalink / raw) Daniel Pittman <daniel@rimspace.net> writes: > Not that this is such a great idea... at least, not unless it's > customizable by the user. It's just meant to be used as the default value on the prompt for `0 g'. -- (domestic pets only, the antidote for overdose, milk.) larsi@gnus.org * Lars Magne Ingebrigtsen ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Have Emacs guess the charset? 2001-08-20 8:25 ` Lars Magne Ingebrigtsen 2001-08-20 9:13 ` Daniel Pittman @ 2001-08-20 9:34 ` Kai Großjohann 2001-08-20 10:02 ` Lars Magne Ingebrigtsen 1 sibling, 1 reply; 16+ messages in thread From: Kai Großjohann @ 2001-08-20 9:34 UTC (permalink / raw) Lars Magne Ingebrigtsen <larsi@gnus.org> writes: > Daniel Pittman <daniel@rimspace.net> writes: > >> You could supply a list of the various far-east encodings to that, which >> /should/ prefer those to the western encodings. Not that the function is >> likely to do much for the western encodings anyway. > > But we don't know what encodings there are in the buffer. Well, as a first step it is enough for me to be able to say: "I think this is Chinese, what does it say?" This is better than having to ask "Show it to me in GB -- does it make sense?" One little problem is that I don't speak or read Chinese and both showing it in GB and showing it in Big5 leads to apparently meaningful Chinese characters -- just different ones. So unless you know what characters to look for, it's not easy to say which one of the two is right. But on the other hand, if a human can't tell, maybe Emacs can't tell, either. And on the third hand, Emacs might know more Chinese than I do... kai -- Symbol's function definition is void: signature ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Have Emacs guess the charset? 2001-08-20 9:34 ` Kai Großjohann @ 2001-08-20 10:02 ` Lars Magne Ingebrigtsen 0 siblings, 0 replies; 16+ messages in thread From: Lars Magne Ingebrigtsen @ 2001-08-20 10:02 UTC (permalink / raw) Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes: > Well, as a first step it is enough for me to be able to say: "I think > this is Chinese, what does it say?" This is better than having to ask > "Show it to me in GB -- does it make sense?" But we don't have a mapping of "chinese" -> likely encodings, do we? -- (domestic pets only, the antidote for overdose, milk.) larsi@gnus.org * Lars Magne Ingebrigtsen ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Have Emacs guess the charset? 2001-08-20 0:36 ` Lars Magne Ingebrigtsen 2001-08-20 1:26 ` Daniel Pittman @ 2001-09-01 16:32 ` Dave Love 1 sibling, 0 replies; 16+ messages in thread From: Dave Love @ 2001-09-01 16:32 UTC (permalink / raw) >>>>> "LMI" == Lars Magne Ingebrigtsen <larsi@gnus.org> writes: LMI> Which means that the correct answer is the third-most-likely LMI> guess... Is big5 particularly difficult to guess, or is the LMI> function bad at guessing? `Return a list of possible coding systems ordered by priority.' M-x apropos priority ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Have Emacs guess the charset? 2001-08-19 23:46 ` Daniel Pittman 2001-08-20 0:36 ` Lars Magne Ingebrigtsen @ 2001-08-21 0:07 ` ShengHuo ZHU 2001-08-21 21:58 ` Lars Magne Ingebrigtsen 2001-09-01 16:32 ` Dave Love 1 sibling, 2 replies; 16+ messages in thread From: ShengHuo ZHU @ 2001-08-21 0:07 UTC (permalink / raw) [-- Attachment #1: Type: text/plain, Size: 974 bytes --] Daniel Pittman <daniel@rimspace.net> writes: > On Sun, 19 Aug 2001, Lars Magne Ingebrigtsen wrote: > > Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes: >> >>> Sometimes I get email which has no predeclared charset. Emacs assumes >>> Latin-1 in those cases. This is good in general. But is there a way >>> to have Emacs inspect the current message and suggest a better >>> charset? >>> >>> In particular, I sometimes know there is Chinese in it, but I don't >>> know if it's GB or Big5 encoded. So I try both until I see a >>> character I recognize. Is there a way to have Emacs/Gnus guess >>> whether it's GB or Big5? >> >> Surely there must be some Mule functions for guessing what charset >> some text is in, but I have no idea what it's called. Anybody? > > `detect-coding-region' I tested the function on the attached files in both Emacs 21 and XEmacs 21.4. I found the one in XEmacs did a good job, but the one in Emacs is almost useless. ShengHuo [-- Attachment #2: Big5.tex --] [-- Type: application/octet-stream, Size: 565 bytes --] % This is the file Big5.tex of the CJK package % for testing Chinese (in Big 5 encoding). % % written by Werner Lemberg <wl@gnu.org> % % Version 4.2.0 (13-Dec-1998) % % % process this file with bg5latex \documentclass[12pt]{article} \usepackage{CJK} \begin{document} \begin{CJK*}{Bg5}{song} \CJKtilde \noindent ¥»±`°Ý°Ýµª¶°~(FAQ list)~¬O±q¤@¨Ç¸g±`³Q°Ý¨ìªº°ÝÃD¤Î¨ä¾A·íªº¸Ñ µª¤¤¡A¥H¤è«Kªº§Î¦¡ºKn¦Ó¥Xªº¡C¸ò¤W¤@ª©¤£¦Pªº¬O¡A¨ä½s±Æµ²ºc¤w¹ý©³§ïÅÜ¡C \textbf{¦³Ãö·sµ²ºcªº²Ó¸`¡A¥i°Ñ¦Ò¡u¦p¦ó¾\Ū¥»°Ýµª¶°¤Î¤F¸Ñ¨ä½s±Æµ²ºc¡v¸Ó ¶µ¤¤ªº»¡©ú¡C} \end{CJK*} \end{document} [-- Attachment #3: GB.tex --] [-- Type: application/octet-stream, Size: 520 bytes --] % This is the file GB.tex of the CJK package % for testing Chinese (in GB encoding). % % written by Werner Lemberg <wl@gnu.org> % % Version 4.2.0 (13-Dec-1998) \documentclass[12pt]{article} \usepackage{CJK} \begin{document} \begin{CJK*}{GB}{song} \CJKtilde \noindent ±¾³£ÎÊÎÊ´ð¼¯~(FAQ list)~ÊÇ´ÓһЩ¾³£±»Îʵ½µÄÎÊÌâ¼°ÆäÊʵ±µÄ½â ´ðÖУ¬ÒÔ·½±ãµÄÐÎʽժҪ¶ø³öµÄ¡£¸úÉÏÒ»°æ²»Í¬µÄÊÇ£¬Æä±àÅŽṹÒѳ¹µ×¸Ä±ä¡£ \textbf{ÓйØнṹµÄϸ½Ú£¬¿É²Î¿¼¡¸ÈçºÎÔĶÁ±¾Îʴ𼯼°Á˽âÆä±àÅŽṹ¡¹¸Ã ÏîÖеÄ˵Ã÷¡£} \end{CJK*} \end{document} ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Have Emacs guess the charset? 2001-08-21 0:07 ` ShengHuo ZHU @ 2001-08-21 21:58 ` Lars Magne Ingebrigtsen 2001-09-01 16:32 ` Dave Love 1 sibling, 0 replies; 16+ messages in thread From: Lars Magne Ingebrigtsen @ 2001-08-21 21:58 UTC (permalink / raw) ShengHuo ZHU <zsh@cs.rochester.edu> writes: > I tested the function on the attached files in both Emacs 21 and > XEmacs 21.4. I found the one in XEmacs did a good job, but the one in > Emacs is almost useless. Right -- that would explain what I'm seeing. Well, I'll just leave the prompting code as it is in `gnus-summary-show-article' -- it'll provide a better default for XEmacs users, and it won't hurt Emacs users any. -- (domestic pets only, the antidote for overdose, milk.) larsi@gnus.org * Lars Magne Ingebrigtsen ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Have Emacs guess the charset? 2001-08-21 0:07 ` ShengHuo ZHU 2001-08-21 21:58 ` Lars Magne Ingebrigtsen @ 2001-09-01 16:32 ` Dave Love 1 sibling, 0 replies; 16+ messages in thread From: Dave Love @ 2001-09-01 16:32 UTC (permalink / raw) >>>>> "ZSH" == ShengHuo ZHU <zsh@cs.rochester.edu> writes: >> `detect-coding-region' ZSH> I tested the function on the attached files in both Emacs 21 and ZSH> XEmacs 21.4. I found the one in XEmacs did a good job, but the one in ZSH> Emacs is almost useless. What does that mean? The function detects the Big5 for me in Emacs. It won't detect GB outside an appropriate language environment, because there's no coding category registered for it. If Emacs doesn't detect an encoding with an appropriate entry in `coding-category-list' it's a bug. However, I think you're on a hiding to nothing with this, even if you have the corresponding coding system with which to decode it. (E.g. _I_ have windows-1252, but most people don't, and you couldn't typically distinguish it from other CCL coding systems which use most, or all, octets.) ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2001-09-01 16:32 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2001-06-15 14:01 Have Emacs guess the charset? Kai Großjohann 2001-08-19 20:41 ` Lars Magne Ingebrigtsen 2001-08-19 23:46 ` Daniel Pittman 2001-08-20 0:36 ` Lars Magne Ingebrigtsen 2001-08-20 1:26 ` Daniel Pittman 2001-08-20 6:35 ` Lars Magne Ingebrigtsen 2001-08-20 7:26 ` Daniel Pittman 2001-08-20 8:25 ` Lars Magne Ingebrigtsen 2001-08-20 9:13 ` Daniel Pittman 2001-08-20 10:02 ` Lars Magne Ingebrigtsen 2001-08-20 9:34 ` Kai Großjohann 2001-08-20 10:02 ` Lars Magne Ingebrigtsen 2001-09-01 16:32 ` Dave Love 2001-08-21 0:07 ` ShengHuo ZHU 2001-08-21 21:58 ` Lars Magne Ingebrigtsen 2001-09-01 16:32 ` Dave Love
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).