ruby-dev (Japanese) list archive (unofficial mirror)
 help / color / mirror / Atom feed
From: duerst@it.aoyama.ac.jp
To: ruby-dev@ruby-lang.org
Subject: [ruby-dev:51072] [Ruby master Bug#12052] String#encode with xml option returns wrong result for totally non-ASCII-compatible encodings
Date: Sat, 26 Jun 2021 00:42:24 +0000 (UTC)	[thread overview]
Message-ID: <redmine.journal-92654.20210626004222.4@ruby-lang.org> (raw)
In-Reply-To: <redmine.issue-12052.20160205025027.4@ruby-lang.org>

Issue #12052 has been updated by duerst (Martin Dürst).


jeremyevans0 (Jeremy Evans) wrote in #note-3:

> It looks like this issue occurs when using both multibyte source and destination encoding.  If either the source or destination encoding is not multibyte, the issue doesn't occur:
> 
> ```ruby
> # Multibyte source, single-byte destination
> "<\0>\0".encode("utf-8", "utf-16le", xml: :text).bytes
> => [38, 108, 116, 59, 38, 103, 116, 59]
> 
> # Single-byte source, multibyte destination
> "<>".encode("utf-16le", "utf-8", xml: :text).bytes
> => [38, 0, 108, 0, 116, 0, 59, 0, 38, 0, 103, 0, 116, 0, 59, 0]
> 
> # Multibyte source, multibyte destination
> "<\0>\0".encode("utf-16le", "utf-16le", xml: :text).bytes
> => [38, 108, 116, 59, 0, 38, 103, 116, 59, 0]
> ``` 

True, except that usually the term "multibyte encoding" includes encodings such as UTF-8, and we are speaking here about encodings with code units longer than one byte.

But thinking about it, it may also include encodings such as EBCDIC (IBM037) and Shift_JIS and ISO-2020-JP. In the former case, I get
```Ruby
"<>".encode("IBM037")
=> "\x4C\x6E"
"<>".encode("IBM037").encode("IBM037", xml: :text)
=> "\x4C\x6E"```
"<>".encode("IBM037").force_encoding("US-ASCII")
=> "Ln"
```
This is explained rather easily: '<' and '>' are \x4C and \x6E in EBCDIC, but because the `xml: :text` processing runs in ASCII, these are interpreted as 'L' and 'n' and left alone. Shift_JIS actually seems safe, because the characters to be converted are encoded as plain ASCII, and because they all fall into the range 0x20..0x3F, which isn't used as a second byte in Shift_JIS.

For ISO-2022-JP, we are not so lucky. Take the string `"<>湿"` (the Kanji stands for 'wet', which is appropriate here in Japan because we are in the Rainy Season now :-; it's there because in ISO-2022-JP, its value is encoded with the same bytes as "<>", after switching with the necessary escape sequences):

```Ruby
"<>青".encode("ISO-2022-JP")
=> "\x3C\x3E\e\x24\x42\x40\x44\e\x28\x42"
"<>湿".encode("ISO-2022-JP").force_encoding("US-ASCII")
=> "<>\e$B<>\e(B"
"<>湿".encode("ISO-2022-JP").encode("ISO-2022-JP", xml: :text)
=> "\x26\x6C\x74\x3B\x26\x67\x74\x3B\e\x24\x42\x26\x6C\x74\x3B\x26\x67\x74\x3B\e\x28\x42"
"<>湿".encode("ISO-2022-JP").encode("ISO-2022-JP", xml: :text).force_encoding("US-ASCII")
=> "&lt;&gt;\e$B&lt;&gt;\e(B"
```
Trying to further transcode the result of `encode("ISO-2022-JP", xml: :text)` leads to an encoding error. 

> So a possible way to work around the issue until it can be properly fixed would be to detect the case where both source and destination are multibyte, switch the destination to UTF-8, then encode the result of that to the desired destination encoding.

The condition seems to be slightly more narrow. Even if both encodings have code units of more than one byte, things work as long as the encodings are not the same, most probably because these cases already get transcoded via UTF-8:
```Ruby
"<\0>\0".force_encoding("UTF-16LE").encode("UTF-16BE", xml: :text)
=> "&lt;&gt;"
"<\0>\0".force_encoding("UTF-16LE").encode("UTF-32BE", xml: :text)
=> "&lt;&gt;"
"<\0>\0".force_encoding("UTF-16LE").encode("UTF-32LE", xml: :text)
=> "&lt;&gt;"
"<\0>\0".force_encoding("UTF-16LE").encode("UTF-16LE", xml: :text)
=> "\u6C26\u3B74\u2600\u7467;"
```

I'll have a look at your patch later, but just wanted to get this out. Sorry to be more quick with encodings than with the actual code :-(.

----------------------------------------
Bug #12052: String#encode with xml option returns wrong result for totally non-ASCII-compatible encodings
https://bugs.ruby-lang.org/issues/12052#change-92654

* Author: nobu (Nobuyoshi Nakada)
* Status: Open
* Priority: Normal
* Assignee: akr (Akira Tanaka)
* Backport: 2.0.0: REQUIRED, 2.1: REQUIRED, 2.2: REQUIRED, 2.3: REQUIRED
----------------------------------------
`String#encode`をASCII非互換エンコーディングから同じエンコーディングへ、`xml:`オプション付きで呼ぶとおかしな結果を返します。
バイナリとして変換してしまっているようです。

```ruby
p "<\0>\0".encode("utf-16le", "utf-16le", xml: :text)
#=> "\u6C26\u3B74\u2600\u7467;"
```



-- 
https://bugs.ruby-lang.org/

  parent reply	other threads:[~2021-06-26  0:42 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <redmine.issue-12052.20160205025027.4@ruby-lang.org>
2021-06-24 23:50 ` [ruby-dev:51068] [Ruby master Bug#12052] String#encode with xml option returns wrong result merch-redmine
2021-06-25  9:39 ` [ruby-dev:51069] [Ruby master Bug#12052] String#encode with xml option returns wrong result for totally non-ASCII-compatible encodings duerst
2021-06-25 17:34 ` [ruby-dev:51070] " merch-redmine
2021-06-25 20:09 ` [ruby-dev:51071] " merch-redmine
2021-06-26  0:42 ` duerst [this message]
2021-07-03  4:49 ` [ruby-dev:51076] " nagachika00
2021-07-03  5:26 ` [ruby-dev:51077] " nagachika00
2021-07-04  2:02 ` [ruby-dev:51078] " duerst
2021-07-04  8:27 ` [ruby-dev:51079] " nagachika00
2021-07-11 22:46 ` [ruby-dev:51081] " nobu
2021-07-18  2:43 ` [ruby-dev:51083] " nagachika00

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=redmine.journal-92654.20210626004222.4@ruby-lang.org \
    --to=duerst@it.aoyama.ac.jp \
    --cc=ruby-dev@ruby-lang.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).