From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS4713 221.184.0.0/13 X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from neon.ruby-lang.org (neon.ruby-lang.org [221.186.184.75]) by dcvr.yhbt.net (Postfix) with ESMTP id 65CC01F8C6 for ; Sat, 26 Jun 2021 00:42:35 +0000 (UTC) Received: from neon.ruby-lang.org (localhost [IPv6:::1]) by neon.ruby-lang.org (Postfix) with ESMTP id 1C073120A38; Sat, 26 Jun 2021 09:41:17 +0900 (JST) Received: from o1678948x4.outbound-mail.sendgrid.net (o1678948x4.outbound-mail.sendgrid.net [167.89.48.4]) by neon.ruby-lang.org (Postfix) with ESMTPS id A6B6B1209FA for ; Sat, 26 Jun 2021 09:41:14 +0900 (JST) Received: by filterdrecv-c8c5888c4-ch8rd with SMTP id filterdrecv-c8c5888c4-ch8rd-1-60D677EF-33 2021-06-26 00:42:24.02696956 +0000 UTC m=+596073.050785622 Received: from herokuapp.com (unknown) by ismtpd0180p1mdw1.sendgrid.net (SG) with ESMTP id Q3YjnUH4SDGf9xBEfySTnA for ; Sat, 26 Jun 2021 00:42:23.852 +0000 (UTC) Date: Sat, 26 Jun 2021 00:42:24 +0000 (UTC) From: duerst@it.aoyama.ac.jp Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Redmine-Project: ruby-master X-Redmine-Issue-Tracker: Bug X-Redmine-Issue-Id: 12052 X-Redmine-Issue-Author: nobu X-Redmine-Issue-Assignee: akr X-Redmine-Sender: duerst X-Mailer: Redmine X-Redmine-Host: bugs.ruby-lang.org X-Redmine-Site: Ruby Issue Tracking System X-Auto-Response-Suppress: All Auto-Submitted: auto-generated X-Redmine-MailingListIntegration-Message-Ids: 80528 X-SG-EID: =?us-ascii?Q?uQY=2F2xNrNfHHTWbKn6MBvvzfU5Pqk9I4lnOVb0CFDutvpzHWHfrg4k+wq4Z8Zm?= =?us-ascii?Q?rs9yvqEPjJm378L6b7vd1g5JVqznTTPucWVsLF2?= =?us-ascii?Q?p7i=2FUkFzCX125a6NuPt+xKBTwtBnTxaA+4JzorD?= =?us-ascii?Q?NZKyE=2Fdt=2FHLMSHaVFmx=2FCLA3B=2Frh95z2v9F3kkg?= =?us-ascii?Q?IujjeFE2pvdsLgo8x8WNH4WSbOMD+aIoUGw=3D=3D?= To: ruby-dev@ruby-lang.org X-Entity-ID: b/2+PoftWZ6GuOu3b0IycA== X-ML-Name: ruby-dev X-Mail-Count: 51072 Subject: [ruby-dev:51072] [Ruby master Bug#12052] String#encode with xml option returns wrong result for totally non-ASCII-compatible encodings X-BeenThere: ruby-dev@ruby-lang.org X-Mailman-Version: 2.1.15 Precedence: list Reply-To: "Ruby developers \(Japanese\)" List-Id: "Ruby developers \(Japanese\)" List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: ruby-dev-bounces@ruby-lang.org Sender: "ruby-dev" Issue #12052 has been updated by duerst (Martin D=C3=BCrst).=0D =0D =0D jeremyevans0 (Jeremy Evans) wrote in #note-3:=0D =0D > It looks like this issue occurs when using both multibyte source and dest= ination encoding. If either the source or destination encoding is not mult= ibyte, the issue doesn't occur:=0D >=20=0D > ```ruby=0D > # Multibyte source, single-byte destination=0D > "<\0>\0".encode("utf-8", "utf-16le", xml: :text).bytes=0D > =3D> [38, 108, 116, 59, 38, 103, 116, 59]=0D >=20=0D > # Single-byte source, multibyte destination=0D > "<>".encode("utf-16le", "utf-8", xml: :text).bytes=0D > =3D> [38, 0, 108, 0, 116, 0, 59, 0, 38, 0, 103, 0, 116, 0, 59, 0]=0D >=20=0D > # Multibyte source, multibyte destination=0D > "<\0>\0".encode("utf-16le", "utf-16le", xml: :text).bytes=0D > =3D> [38, 108, 116, 59, 0, 38, 103, 116, 59, 0]=0D > ```=20=0D =0D True, except that usually the term "multibyte encoding" includes encodings = such as UTF-8, and we are speaking here about encodings with code units lon= ger than one byte.=0D =0D But thinking about it, it may also include encodings such as EBCDIC (IBM037= ) and Shift_JIS and ISO-2020-JP. In the former case, I get=0D ```Ruby=0D "<>".encode("IBM037")=0D =3D> "\x4C\x6E"=0D "<>".encode("IBM037").encode("IBM037", xml: :text)=0D =3D> "\x4C\x6E"```=0D "<>".encode("IBM037").force_encoding("US-ASCII")=0D =3D> "Ln"=0D ```=0D This is explained rather easily: '<' and '>' are \x4C and \x6E in EBCDIC, b= ut because the `xml: :text` processing runs in ASCII, these are interpreted= as 'L' and 'n' and left alone. Shift_JIS actually seems safe, because the = characters to be converted are encoded as plain ASCII, and because they all= fall into the range 0x20..0x3F, which isn't used as a second byte in Shift= _JIS.=0D =0D For ISO-2022-JP, we are not so lucky. Take the string `"<>=E6=B9=BF"` (the = Kanji stands for 'wet', which is appropriate here in Japan because we are i= n the Rainy Season now :-; it's there because in ISO-2022-JP, its value is = encoded with the same bytes as "<>", after switching with the necessary esc= ape sequences):=0D =0D ```Ruby=0D "<>=E9=9D=92".encode("ISO-2022-JP")=0D =3D> "\x3C\x3E\e\x24\x42\x40\x44\e\x28\x42"=0D "<>=E6=B9=BF".encode("ISO-2022-JP").force_encoding("US-ASCII")=0D =3D> "<>\e$B<>\e(B"=0D "<>=E6=B9=BF".encode("ISO-2022-JP").encode("ISO-2022-JP", xml: :text)=0D =3D> "\x26\x6C\x74\x3B\x26\x67\x74\x3B\e\x24\x42\x26\x6C\x74\x3B\x26\x67\x7= 4\x3B\e\x28\x42"=0D "<>=E6=B9=BF".encode("ISO-2022-JP").encode("ISO-2022-JP", xml: :text).force= _encoding("US-ASCII")=0D =3D> "<>\e$B<>\e(B"=0D ```=0D Trying to further transcode the result of `encode("ISO-2022-JP", xml: :text= )` leads to an encoding error.=20=0D =0D > So a possible way to work around the issue until it can be properly fixed= would be to detect the case where both source and destination are multibyt= e, switch the destination to UTF-8, then encode the result of that to the d= esired destination encoding.=0D =0D The condition seems to be slightly more narrow. Even if both encodings have= code units of more than one byte, things work as long as the encodings are= not the same, most probably because these cases already get transcoded via= UTF-8:=0D ```Ruby=0D "<\0>\0".force_encoding("UTF-16LE").encode("UTF-16BE", xml: :text)=0D =3D> "<>"=0D "<\0>\0".force_encoding("UTF-16LE").encode("UTF-32BE", xml: :text)=0D =3D> "<>"=0D "<\0>\0".force_encoding("UTF-16LE").encode("UTF-32LE", xml: :text)=0D =3D> "<>"=0D "<\0>\0".force_encoding("UTF-16LE").encode("UTF-16LE", xml: :text)=0D =3D> "\u6C26\u3B74\u2600\u7467;"=0D ```=0D =0D I'll have a look at your patch later, but just wanted to get this out. Sorr= y to be more quick with encodings than with the actual code :-(.=0D =0D ----------------------------------------=0D Bug #12052: String#encode with xml option returns wrong result for totally = non-ASCII-compatible encodings=0D https://bugs.ruby-lang.org/issues/12052#change-92654=0D =0D * Author: nobu (Nobuyoshi Nakada)=0D * Status: Open=0D * Priority: Normal=0D * Assignee: akr (Akira Tanaka)=0D * Backport: 2.0.0: REQUIRED, 2.1: REQUIRED, 2.2: REQUIRED, 2.3: REQUIRED=0D ----------------------------------------=0D `String#encode`=E3=82=92ASCII=E9=9D=9E=E4=BA=92=E6=8F=9B=E3=82=A8=E3=83=B3= =E3=82=B3=E3=83=BC=E3=83=87=E3=82=A3=E3=83=B3=E3=82=B0=E3=81=8B=E3=82=89=E5= =90=8C=E3=81=98=E3=82=A8=E3=83=B3=E3=82=B3=E3=83=BC=E3=83=87=E3=82=A3=E3=83= =B3=E3=82=B0=E3=81=B8=E3=80=81`xml:`=E3=82=AA=E3=83=97=E3=82=B7=E3=83=A7=E3= =83=B3=E4=BB=98=E3=81=8D=E3=81=A7=E5=91=BC=E3=81=B6=E3=81=A8=E3=81=8A=E3=81= =8B=E3=81=97=E3=81=AA=E7=B5=90=E6=9E=9C=E3=82=92=E8=BF=94=E3=81=97=E3=81=BE= =E3=81=99=E3=80=82=0D =E3=83=90=E3=82=A4=E3=83=8A=E3=83=AA=E3=81=A8=E3=81=97=E3=81=A6=E5=A4=89=E6= =8F=9B=E3=81=97=E3=81=A6=E3=81=97=E3=81=BE=E3=81=A3=E3=81=A6=E3=81=84=E3=82= =8B=E3=82=88=E3=81=86=E3=81=A7=E3=81=99=E3=80=82=0D =0D ```ruby=0D p "<\0>\0".encode("utf-16le", "utf-16le", xml: :text)=0D #=3D> "\u6C26\u3B74\u2600\u7467;"=0D ```=0D =0D =0D =0D --=20=0D https://bugs.ruby-lang.org/=0D