Avoiding double encoding in subject

Gnus development mailing list
 help / color / mirror / Atom feed

* Avoiding double encoding in subject
@ 2006-07-26 11:00 Reiner Steib
  2006-07-27 15:55 ` Michael Piotrowski
  0 siblings, 1 reply; 4+ messages in thread
From: Reiner Steib @ 2006-07-26 11:00 UTC (permalink / raw)


Hi,

when replying to an article who's subject contains an unknown (or
invalid) encoding...

| Newsgroups: gmane.test
| Archived-At: <http://permalink.gmane.org/gmane.test/3051>
| Message-ID: <v94px7cjcs.fsf@marauder.physik.uni-ulm.de>
| Subject: bogus or unknown charset in =?iso-8859-17?Q?=E4?= subject

-- i.e. the charset is unknown to (X)Emacs[1] -- and not present in
`mm-charset-synonym-alist', Gnus produces a subject like...

| Newsgroups: gmane.test
| Archived-At: <http://permalink.gmane.org/gmane.test/3052>
| Message-ID: <v9y7ujb39h.fsf@marauder.physik.uni-ulm.de>
| Subject: Re: bogus or unknown charset in =?us-ascii?Q?=3D=3Fiso-8859-17=3F?=
|  =?us-ascii?Q?Q=3F=3DE4=3F=3D?= subject

Bad.

I'm not sure how Gnus should handle this situation.  Some
possibilities:

(1) Gnus could (probably, I don't know if it is feasible to implement
    this) mark the Subject as "not decoded" and resend it "as is"
    without the double[2] us-ascii encoding.  Gnus also has to make
    sure that this mark survives when the article is saved to the
    drafts folder.

    Problem: If the given charset is really invalid rather than
    unknown (the user usually can't decide), Gnus will also produce an
    incorrect article.

(2) Gnus' decoder could replace the unknown/invalid characters with a
    replacement character ("?", U+FFFD = REPLACEMENT CHARACTER, ...).

    Problem: It's probably not possible to get the number of
    replacement characters right.

Other suggestions?

Bye, Reiner.

[1] This may happen when (X)Emacs is to old to support a newly
    introduced charset.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Avoiding double encoding in subject
  2006-07-26 11:00 Avoiding double encoding in subject Reiner Steib
@ 2006-07-27 15:55 ` Michael Piotrowski
  2006-11-06 20:00   ` Reiner Steib
  0 siblings, 1 reply; 4+ messages in thread
From: Michael Piotrowski @ 2006-07-27 15:55 UTC (permalink / raw)

On 2006-07-26 Reiner Steib <reinersteib+gmane@imap.cc> wrote:

> when replying to an article who's subject contains an unknown (or
> invalid) encoding...

[...]

> I'm not sure how Gnus should handle this situation.  Some
> possibilities:

[...]

> (2) Gnus' decoder could replace the unknown/invalid characters with a
>     replacement character ("?", U+FFFD = REPLACEMENT CHARACTER, ...).
>
>     Problem: It's probably not possible to get the number of
>     replacement characters right.

Something like option (2) plus user interaction might be the right
approach:

Gnus detects a RFC 2047-encoded string in a header.  If the charset is
known, everything is fine and it's decoded.  If the charset is
unknown, ask the user, e.g.:

  Unknown charset "iso-8859-17" in Subject header, what now? (c, u or C-h):

The user could then either type "c" and specify a charset which should
be used to interpret it or type "u" to accept it as unknown; in this
case I'd simply replace each octet with the replacement character
since there is now way to find out the character size of an unknown
encoding.

Greetings

-- 
Michael Piotrowski, M.A.                               <mxp@dynalabs.de>
Public key at <http://www.dynalabs.de/mxp/pubkey.txt>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Avoiding double encoding in subject
  2006-07-27 15:55 ` Michael Piotrowski
@ 2006-11-06 20:00   ` Reiner Steib
  2006-11-09 18:13     ` Reiner Steib
  0 siblings, 1 reply; 4+ messages in thread
From: Reiner Steib @ 2006-11-06 20:00 UTC (permalink / raw)


[ Quotation added because of answering to an old message ]

On Thu, Jul 27 2006, Michael Piotrowski wrote:

> On 2006-07-26 Reiner Steib <reinersteib+gmane@imap.cc> wrote:
>
>> when replying to an article who's subject contains an unknown (or
>> invalid) encoding...
>> 
>> | Newsgroups: gmane.test
>> | Archived-At: <http://permalink.gmane.org/gmane.test/3051>
>> | Message-ID: <v94px7cjcs.fsf@marauder.physik.uni-ulm.de>
>> | Subject: bogus or unknown charset in =?iso-8859-17?Q?=E4?= subject
>> 
>> -- i.e. the charset is unknown to (X)Emacs[1] -- and not present in
>> `mm-charset-synonym-alist', Gnus produces a subject like...
>> 
>> | Newsgroups: gmane.test
>> | Archived-At: <http://permalink.gmane.org/gmane.test/3052>
>> | Message-ID: <v9y7ujb39h.fsf@marauder.physik.uni-ulm.de>
>> | Subject: Re: bogus or unknown charset in =?us-ascii?Q?=3D=3Fiso-8859-17=3F?=
>> |  =?us-ascii?Q?Q=3F=3DE4=3F=3D?= subject
>> 
>> Bad.
>> 
>> I'm not sure how Gnus should handle this situation.  Some
>> possibilities:
>
>> (1) Gnus could (probably, I don't know if it is feasible to implement
>>     this) mark the Subject as "not decoded" and resend it "as is"
>>     without the double[2] us-ascii encoding.  Gnus also has to make
>>     sure that this mark survives when the article is saved to the
>>     drafts folder.
>> 
>>     Problem: If the given charset is really invalid rather than
>>     unknown (the user usually can't decide), Gnus will also produce an
>>     incorrect article.
>> 
>> (2) Gnus' decoder could replace the unknown/invalid characters with a
>>     replacement character ("?", U+FFFD = REPLACEMENT CHARACTER, ...).
>>
>>     Problem: It's probably not possible to get the number of
>>     replacement characters right.

The fact that we only need to do something if the user replies to a
message makes it more simple.  We can tackle the problem when the user
hits `F' like we do when stripping "(was: ...)".

> Something like option (2) plus user interaction might be the right
> approach:
>
> Gnus detects a RFC 2047-encoded string in a header.  If the charset is
> known, everything is fine and it's decoded.  If the charset is
> unknown, ask the user, e.g.:
>
>   Unknown charset "iso-8859-17" in Subject header, what now? (c, u or C-h):
>
> The user could then either type "c" and specify a charset which should
> be used to interpret it or type "u" to accept it as unknown; in this
> case I'd simply replace each octet with the replacement character
> since there is now way to find out the character size of an unknown
> encoding.

I've added the new function `message-strip-subject-encoded-words' to
CVS trunk, but I didn't enable it yet.  If nobody finds a problem, I'll
enable it by default tomorrow and will later merge it to v5-10 as
well.

I'd appreciate if someone could look into the code and I'd also like
people to ask people to test it in by adding it to
`message-simplify-subject-functions'[1] now.

Bye, Reiner.

[1]
(add-to-list 'message-simplify-subject-functions
	     'message-strip-subject-encoded-words
	     t)
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Avoiding double encoding in subject
  2006-11-06 20:00   ` Reiner Steib
@ 2006-11-09 18:13     ` Reiner Steib
  0 siblings, 0 replies; 4+ messages in thread
From: Reiner Steib @ 2006-11-09 18:13 UTC (permalink / raw)


On Mon, Nov 06 2006, Reiner Steib wrote:

> I've added the new function `message-strip-subject-encoded-words' to
> CVS trunk, but I didn't enable it yet.  If nobody finds a problem, I'll
> enable it by default tomorrow and will later merge it to v5-10 as
> well.

Done.  I have found and fixed some problems.

See the threads starting with
<news:4adc91846507dfc3bbd2f23c7cc32a39@wachinger.fqdn.th-h.de> in
de.comm.software.newsreader and <news:eiheec$acn$1@online.de> in
de.comp.editoren for some examples in the wild.  (If you don't have
access to de.* I could send interested testers an mbox file with this
threads.)

Or the thread starting with
<news:v94px7cjcs.fsf@marauder.physik.uni-ulm.de> in gmane.test
<http://thread.gmane.org/v94px7cjcs.fsf@marauder.physik.uni-ulm.de>
(BTW: Loom, Gmane's web interface doesn't show the articles
correctly.)


`message-strip-subject-encoded-words' doesn't correct a similar
problem which was mentioned in
<news:v9irka2xuc.fsf_-_@marauder.physik.uni-ulm.de>:

The following encoding of "wöchentlich?" is wrong:

| Subject: wrong encoding: =?utf-8?Q?w=C3=B6chentlich??=

This is correct:

| Subject: wrong encoding: =?utf-8?Q?w=C3=B6chentlich=3F?=

Gnus doesn't detect the mistake and the subject in a reply will be
double encoded as well (which is technically correct, but maybe a
little more "liberal what you accept" might be okay):

,----
| Subject: Re: wrong encoding: =?us-ascii?Q?=3D=3Futf-8=3FQ=3Fw=3DC3=3DB6che?=
|  =?us-ascii?Q?ntlich=3F=3F=3D?=
`----

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-11-09 18:13 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-07-26 11:00 Avoiding double encoding in subject Reiner Steib
2006-07-27 15:55 ` Michael Piotrowski
2006-11-06 20:00   ` Reiner Steib
2006-11-09 18:13     ` Reiner Steib

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).