From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/60863
Path: news.gmane.org!not-for-mail
From: Hrvoje Niksic <hniksic@xemacs.org>
Newsgroups: gmane.emacs.gnus.general
Subject: CRLF canonicalization only done for text/plain
Date: Fri, 02 Sep 2005 01:35:48 +0200
Message-ID: <87aciws6nf.fsf@xemacs.org>
NNTP-Posting-Host: main.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
X-Trace: sea.gmane.org 1125619001 6906 80.91.229.2 (1 Sep 2005 23:56:41 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Thu, 1 Sep 2005 23:56:41 +0000 (UTC)
Cc: hniksic@xemacs.org
Original-X-From: ding-owner+m9395=ding+2daccount=gmane.org@lists.math.uh.edu Fri Sep 02 01:56:31 2005
Return-path: <ding-owner+m9395=ding+2daccount=gmane.org@lists.math.uh.edu>
Original-Received: from malifon.math.uh.edu ([129.7.128.13])
	by ciao.gmane.org with esmtp (Exim 4.43)
	id 1EAytm-0005Cx-U1
	for ding-account@gmane.org; Fri, 02 Sep 2005 01:55:15 +0200
Original-Received: from localhost
	([127.0.0.1] helo=lists.math.uh.edu ident=lists)
	by malifon.math.uh.edu with smtp (Exim 3.20 #1)
	id 1EAytl-00038C-01
	for ding-account@gmane.org; Thu, 01 Sep 2005 18:55:13 -0500
Original-Received: from nas01.math.uh.edu ([129.7.128.39])
	by malifon.math.uh.edu with esmtp (Exim 3.20 #1)
	id 1EAyb4-00036Y-00
	for ding@lists.math.uh.edu; Thu, 01 Sep 2005 18:35:54 -0500
Original-Received: from quimby.gnus.org ([80.91.224.244])
	by nas01.math.uh.edu with esmtp (Exim 4.52)
	id 1EAyb2-0003Ar-7x
	for ding@lists.math.uh.edu; Thu, 01 Sep 2005 18:35:54 -0500
Original-Received: from ls405.htnet.hr ([195.29.150.97])
	by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian))
	id 1EAyb0-00041n-00
	for <ding@gnus.org>; Fri, 02 Sep 2005 01:35:50 +0200
Original-Received: from ls422.t-com.hr (ls422.t-com.hr [195.29.150.237])
	by ls405.htnet.hr (0.0.0/8.12.10) with ESMTP id j81NZot8017833;
	Fri, 2 Sep 2005 01:35:50 +0200
Original-Received: from ls422.t-com.hr (localhost.localdomain [127.0.0.1])
	by ls422.t-com.hr (Qmlai) with ESMTP id 6D8C0988042;
	Fri,  2 Sep 2005 01:33:25 +0200 (CEST)
X-Envelope-Sender: hniksic@xemacs.org
X-Envelope-Sender: hniksic@xemacs.org
Original-Received: from ls422.t-com.hr (localhost.localdomain [127.0.0.1])
	by ls422.t-com.hr (Qmlai) with ESMTP id 61CF098803F;
	Fri,  2 Sep 2005 01:33:25 +0200 (CEST)
Original-Received: from localhost.localdomain (83-131-67-249.adsl.net.t-com.hr [83.131.67.249])
	by ls422.t-com.hr (Qmlai) with ESMTP id 508508B803B;
	Fri,  2 Sep 2005 01:33:24 +0200 (CEST)
Original-Received: by localhost.localdomain (Postfix, from userid 1000)
	id 6F874380004; Fri,  2 Sep 2005 01:35:48 +0200 (CEST)
Original-To: ding@gnus.org
User-Agent: Gnus/5.1007 (Gnus v5.10.7) XEmacs/21.4.17 (Jumbo Shrimp, linux)
X-Spam-Score: 0.1 (/)
Precedence: bulk
Original-Sender: ding-owner@lists.math.uh.edu
Xref: news.gmane.org gmane.emacs.gnus.general:60863
Archived-At: <http://permalink.gmane.org/gmane.emacs.gnus.general/60863>

[ Please Cc responses to me because I'm not on the list. ]

Today I received mail with a Python script attached and, to my
surprise, discovered that Gnus saved it using CRLF for line endings.
Since it was sent from a Unix machine and and I'm pretty sure the
source didn't contain CRLF, this puzzled me.

It turns out that the mail was sent by Evolution which prepared the
following attachment:

Content-Disposition: attachment; filename=rnditems
Content-Transfer-Encoding: base64
Content-Type: text/x-python; name=rnditems; charset=ISO-8859-2

IyEvdXNyL2Jpbi9weXRob24NCg0KaW1wb3J0IG9wdHBhcnNlLCBzeXMsIHJhbmRvbSwgc3RyaW5n
DQoNCmRlZiBzaGVsbF9xdW90ZShzdHIpOg0KICAgIHJldHVybiAiJyVzJyIgJSBzdHJpbmcucmVw
bGFjZShzdHIsICInIiwgIidcIidcIiciKQ0KDQpwYXJzZXIgPSBvcHRwYXJzZS5PcHRpb25QYXJz
ZXIodXNhZ2U9IiVwcm9nIFstbiBNSU5dIFstLXF1b3RlXSBJVEVNLi4uIikNCnBhcnNlci5hZGRf
b3B0aW9uKCctbicsICctLW51bWJlcicsIGRlc3Q9J251bWJlcicsIHR5cGU9J2ludCcsIGhlbHA9
J21heCBudW1iZXIgb2YgaXRlbXMgdG8gcmV0dXJuJykNCnBhcnNlci5hZGRfb3B0aW9uKCctcScs
ICctLXF1b3RlJywgZGVzdD0ncXVvdGUnLCBhY3Rpb249J3N0b3JlX3RydWUnLCBoZWxwPSdxdW90
ZSBwcmludGVkIGl0ZW1zJykNCm9wdGlvbiwgYXJncyA9IHBhcnNlci5wYXJzZV9hcmdzKCkNCg0K
aWYgb3B0aW9uLnF1b3RlOg0KICAgIGFyZ3MgPSBbc2hlbGxfcXVvdGUoeCkgZm9yIHggaW4gYXJn
c10NCg0KcmFuZG9tLnNodWZmbGUoYXJncykNCmZvciBpLCBpdGVtIGluIGVudW1lcmF0ZShhcmdz
KToNCiAgICBpZiBvcHRpb24ubnVtYmVyIGlzIG5vdCBOb25lIGFuZCBpID49IG9wdGlvbi5udW1i
ZXI6DQogICAgICAgIGJyZWFrDQogICAgcHJpbnQgaXRlbQ0K

Decoding the base64 shows that the text indeed contains CRLF pairs;
however, since the Content-Type is text/*, I think it is meant to be
the "canonical representation".  After decoding the base64, Gnus
should have converted that representation to the local line coding
convention (i.e. converted CRLF to LF and optionally let Mule handle
the actual conversion).  That Gnus didn't do this came as a surprise
because I was pretty sure that Gnus had the correct code to handle
this exact situation.

A closer inspection of Gnus shows that it does contain CRLF
(de)-canonicalization code, but only for "text/plain" attachments,
whereas the above was "text/x-python":

(defun mm-decode-content-transfer-encoding (encoding &optional type)
  "Decodes buffer encoded with ENCODING, returning success status.
If TYPE is `text/plain' CRLF->LF translation may occur."
  ...
    (when (and
	   (memq encoding '(base64 x-uuencode x-uue x-binhex x-yenc))
	   (equal type "text/plain"))
      (goto-char (point-min))
      (while (search-forward "\r\n" nil t)
	(replace-match "\n" t t)))))

In other words, Evolution seems to think that "canonical
representation" of line feeds pertains to all text/* types, whereas
Gnus thinks that it pertains only to text/plain.  A reading of section
6.8 seems to indicate that Evolution is right:

   Care must be taken to use the proper octets for line breaks if
   base64 encoding is applied directly to text material that has not
   been converted to canonical form.  In particular, *text line breaks
   must be converted into CRLF sequences prior to base64 encoding*
   [emphasis mine].  The important thing to note is that this may be
   done directly by the encoder rather than in a prior
   canonicalization step in some implementations.

This talks about "text line breaks" as "text material", which can only
meaningfully refer to all text/* content types, at least unless
explicitly specified otherwise.  Applying it only to text/plain seems
wrong -- a useful feature of the TYPE/SUBTYPE division is that certain
properties can describe a type regardless of the subtypes.

rfc2049 does single out text/plain in section 4, bullet 2.  However, I
don't think it intends to state that canonicalization (and the implied
decanonicalization) implies only to text/plain.  It says:

          [...] If character set conversion is involved, however, care
          must be taken to understand the semantics of the media type,
          which may have strong implications for any character set
          conversion, e.g. with regard to syntactically meaningful
          characters in a text subtype other than "plain".

That is, different character set conversion rules may apply to text
types other than text/plain -- but nothing is said of CRLF
conversions.  And then:

          For example, in the case of text/plain data, the text must
          be converted to a supported character set and lines must be
          delimited with CRLF delimiters in accordance with RFC 822.
          Note that the restriction on line lengths implied by RFC 822
          is eliminated if the next step employs either
          quoted-printable or base64 encoding.

This uses text/plain data as an *example* of how text data can be
treated, i.e. that it requires both charset conversion and CRLF
canonicalization.  But it doesn't imply that subtypes other than
"plain" don't need to undergo (de)canonicalization.


Hopefully the above should be enough to convince you that Gnus does
not currently do the right thing.  Fortunately the change is simple
enough, implemented by applying the patch below.  Please let me know
if you agree with this change.

2005-09-02  Hrvoje Niksic  <hniksic@xemacs.org>

	* mm-encode.el (mm-encode-content-transfer-encoding): Likewise
	when encoding.

	* mm-bodies.el (mm-decode-content-transfer-encoding):
	De-canonicalize CRLF for all text content types, not just
	text/plain.

--- lisp/mm-bodies.el.orig	2005-09-02 00:46:57.000000000 +0200
+++ lisp/mm-bodies.el	2005-09-02 00:47:14.000000000 +0200
@@ -218,7 +218,7 @@
 	 nil))
     (when (and
 	   (memq encoding '(base64 x-uuencode x-uue x-binhex x-yenc))
-	   (equal type "text/plain"))
+	   (string-match "\\`text/" type))
       (goto-char (point-min))
       (while (search-forward "\r\n" nil t)
 	(replace-match "\n" t t)))))
--- lisp/mm-encode.el.orig	2005-09-02 01:07:17.000000000 +0200
+++ lisp/mm-encode.el	2005-09-02 01:07:20.000000000 +0200
@@ -106,7 +106,7 @@
     ;; Likewise base64 below.
     (quoted-printable-encode-region (point-min) (point-max) t))
    ((eq encoding 'base64)
-    (when (equal type "text/plain")
+    (when (string-match "\\`text/" type)
       (goto-char (point-min))
       (while (search-forward "\n" nil t)
 	(replace-match "\r\n" t t)))