* [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
@ 2024-06-25 6:39 byroot (Jean Boussier) via ruby-core
2024-06-25 15:33 ` [ruby-core:118389] " Eregon (Benoit Daloze) via ruby-core
` (23 more replies)
0 siblings, 24 replies; 25+ messages in thread
From: byroot (Jean Boussier) via ruby-core @ 2024-06-25 6:39 UTC (permalink / raw)
To: ruby-core; +Cc: byroot (Jean Boussier)
Issue #20594 has been reported by byroot (Jean Boussier).
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594
* Author: byroot (Jean Boussier)
* Status: Open
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118389] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
@ 2024-06-25 15:33 ` Eregon (Benoit Daloze) via ruby-core
2024-06-26 15:51 ` [ruby-core:118390] " maximecb (Maxime Chevalier-Boisvert) via ruby-core
` (22 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: Eregon (Benoit Daloze) via ruby-core @ 2024-06-25 15:33 UTC (permalink / raw)
To: ruby-core; +Cc: Eregon (Benoit Daloze)
Issue #20594 has been updated by Eregon (Benoit Daloze).
+1 from me.
I forgot that `String#concat` accepted integers.
Together with accepting variable arguments, it feels like this new method would be quite a bit complicated to implement and optimize well.
But, it's consistent with `String#concat` so I think it makes sense as proposed.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-108905
* Author: byroot (Jean Boussier)
* Status: Open
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118390] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
2024-06-25 15:33 ` [ruby-core:118389] " Eregon (Benoit Daloze) via ruby-core
@ 2024-06-26 15:51 ` maximecb (Maxime Chevalier-Boisvert) via ruby-core
2024-07-11 5:06 ` [ruby-core:118543] " matz (Yukihiro Matsumoto) via ruby-core
` (21 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: maximecb (Maxime Chevalier-Boisvert) via ruby-core @ 2024-06-26 15:51 UTC (permalink / raw)
To: ruby-core; +Cc: maximecb (Maxime Chevalier-Boisvert)
Issue #20594 has been updated by maximecb (Maxime Chevalier-Boisvert).
> I consulted @maximecb (Maxime Chevalier-Boisvert) about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
Yeah my personal preference would be for something like `bytepush` or `byteappend` that accepts a single argument, the simplest possible method, that is kept as strict as possible, while still getting the job done.
The technical reason being that in order to handle multiple arguments, we have to essentially unroll the loop in the YJIT codegen, and speculate that the arguments will all keep the same type at run-time. This is likely to be true, but it means that we have to speculate on many things at once and then generate a big piece of machine code all at the same time.
I understand that there is some temptation to keep the API more similar to other String methods with a similar purpose, but IMO this method is already unusual because it's essentially an `unsafe` operation (no encoding validation). The reason we're discussing this method in the first place is performance, and so maybe performance/simplicity should be the main concern.
If the only way this change will be allowed to happen is to allow a variable argument count, then OK I guess, but I would like to push for something more simple, less dynamic. We sort of have the benefit of hindsight here, but take the Ruby binding API for example. If we were designing Ruby today, there is no way we would choose to make binding as powerful and unrestricted as it is. It's much easier to remove restrictions later than to add them back in.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-108906
* Author: byroot (Jean Boussier)
* Status: Open
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118543] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
2024-06-25 15:33 ` [ruby-core:118389] " Eregon (Benoit Daloze) via ruby-core
2024-06-26 15:51 ` [ruby-core:118390] " maximecb (Maxime Chevalier-Boisvert) via ruby-core
@ 2024-07-11 5:06 ` matz (Yukihiro Matsumoto) via ruby-core
2024-07-11 5:18 ` [ruby-core:118545] " byroot (Jean Boussier) via ruby-core
` (20 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: matz (Yukihiro Matsumoto) via ruby-core @ 2024-07-11 5:06 UTC (permalink / raw)
To: ruby-core; +Cc: matz (Yukihiro Matsumoto)
Issue #20594 has been updated by matz (Yukihiro Matsumoto).
Assignee set to byroot (Jean Boussier)
I agree with having such a method. However, I disagree with the name `byteconcat`. Since other methods with byte prefixes have the behavior of counting the index in bytes, but this method has nothing to do with the index, but with the encoding.
If this method only works with `BINARY` encoding, the name might be `binary_concat` or `binconcat`. In the developers' meeting, someone proposed `force_concat`.
Matz.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109061
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118545] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (2 preceding siblings ...)
2024-07-11 5:06 ` [ruby-core:118543] " matz (Yukihiro Matsumoto) via ruby-core
@ 2024-07-11 5:18 ` byroot (Jean Boussier) via ruby-core
2024-07-11 5:38 ` [ruby-core:118547] " byroot (Jean Boussier) via ruby-core
` (19 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: byroot (Jean Boussier) via ruby-core @ 2024-07-11 5:18 UTC (permalink / raw)
To: ruby-core; +Cc: byroot (Jean Boussier)
Issue #20594 has been updated by byroot (Jean Boussier).
> Since other methods with byte prefixes have the behavior of counting the index in bytes, but this method has nothing to do with the index
I kinda disagree here. My understanding is that the `byte*` methods operate on bytes rather than characters, so `size != bytesize`.
Here the idea is to concat bytes, not characters so I think it fits.
> If this method only works with BINARY encoding,
It doesn't no. It can work for any encoding, you can use it for instance to assemble an UTF-8 string from some network reads:
```ruby
buf = +"" # UTF-8
while chunk = io.read(1024) # ASCII-BIT
buf.byteconcat(chunk)
buf.valid_encoding? # May be true or false
end
buf
```
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109062
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118547] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (3 preceding siblings ...)
2024-07-11 5:18 ` [ruby-core:118545] " byroot (Jean Boussier) via ruby-core
@ 2024-07-11 5:38 ` byroot (Jean Boussier) via ruby-core
2024-07-11 8:43 ` [ruby-core:118554] " mame (Yusuke Endoh) via ruby-core
` (18 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: byroot (Jean Boussier) via ruby-core @ 2024-07-11 5:38 UTC (permalink / raw)
To: ruby-core; +Cc: byroot (Jean Boussier)
Issue #20594 has been updated by byroot (Jean Boussier).
Discussed in person in the meeting. I'll think about other names and propose some.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109067
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118554] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (4 preceding siblings ...)
2024-07-11 5:38 ` [ruby-core:118547] " byroot (Jean Boussier) via ruby-core
@ 2024-07-11 8:43 ` mame (Yusuke Endoh) via ruby-core
2024-07-11 8:45 ` [ruby-core:118555] " byroot (Jean Boussier) via ruby-core
` (17 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: mame (Yusuke Endoh) via ruby-core @ 2024-07-11 8:43 UTC (permalink / raw)
To: ruby-core; +Cc: mame (Yusuke Endoh)
Issue #20594 has been updated by mame (Yusuke Endoh).
Existing methods with byte-prefix (String#byteindex, #bytesplite, etc.) mean that the unit of offset or size is in byte.
`byteconcat` and `byteappend`, on the other hand, are confusing because they have no offset or size, but they mean concatenation without regard to encodings.
There were some alternative proposals that came out of the dev meeting.
* `force_concat` (matz: it lacks "encoding", I don't like it so much)
* `binary_concat` (It should work only when the receiver's encoding is BINARY. Does it fit with @byroot 's motivation?)
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109075
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118555] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (5 preceding siblings ...)
2024-07-11 8:43 ` [ruby-core:118554] " mame (Yusuke Endoh) via ruby-core
@ 2024-07-11 8:45 ` byroot (Jean Boussier) via ruby-core
2024-07-11 10:54 ` [ruby-core:118560] " Eregon (Benoit Daloze) via ruby-core
` (16 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: byroot (Jean Boussier) via ruby-core @ 2024-07-11 8:45 UTC (permalink / raw)
To: ruby-core; +Cc: byroot (Jean Boussier)
Issue #20594 has been updated by byroot (Jean Boussier).
> binary_concat (It should work only when the receiver's encoding is BINARY. Does it fit with @byroot (Jean Boussier) 's motivation?)
No it's too limiting. It should work with any encoding in my opinion.
I'll try to come up with other names. So far I'm thinking of `String#append_bytes(String)`.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109076
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118560] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (6 preceding siblings ...)
2024-07-11 8:45 ` [ruby-core:118555] " byroot (Jean Boussier) via ruby-core
@ 2024-07-11 10:54 ` Eregon (Benoit Daloze) via ruby-core
2024-07-11 12:49 ` [ruby-core:118562] " byroot (Jean Boussier) via ruby-core
` (15 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: Eregon (Benoit Daloze) via ruby-core @ 2024-07-11 10:54 UTC (permalink / raw)
To: ruby-core; +Cc: Eregon (Benoit Daloze)
Issue #20594 has been updated by Eregon (Benoit Daloze).
mame (Yusuke Endoh) wrote in #note-7:
> Existing methods with byte-prefix (String#byteindex, #bytesplite, etc.) mean that the unit of offset or size is in byte.
My understanding of `byte*` methods is they treat the String as a byte array, which implies indices are just byte indices but also that the encoding is ignored (it seems clear when one does `"é".getbyte(0)`).
It's (almost) as-if the string had the BINARY encoding for the duration of the operation, but without the overhead to switch to BINARY and back (which notably could cause some extra code range computation, etc).
BTW, I would consider `each_byte` also a `byte*` method, and that one does not accept or pass byte indices.
So I think it would make sense to extend the meaning of `byte*` methods to be a little more general, just like I explained above.
I don't think it was documented to be only about byte indices either.
That said, I think `String#append_bytes(String)` sounds fine too.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109083
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118562] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (7 preceding siblings ...)
2024-07-11 10:54 ` [ruby-core:118560] " Eregon (Benoit Daloze) via ruby-core
@ 2024-07-11 12:49 ` byroot (Jean Boussier) via ruby-core
2024-07-11 15:19 ` [ruby-core:118564] " Eregon (Benoit Daloze) via ruby-core
` (14 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: byroot (Jean Boussier) via ruby-core @ 2024-07-11 12:49 UTC (permalink / raw)
To: ruby-core; +Cc: byroot (Jean Boussier)
Issue #20594 has been updated by byroot (Jean Boussier).
> but also that the encoding is ignored
That was my understanding as well, but given that `bytesplice` does potentially raise `EncodingError`, there is a mismatch here.
In this case I very much want to allow breaking a string encoding.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109084
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118564] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (8 preceding siblings ...)
2024-07-11 12:49 ` [ruby-core:118562] " byroot (Jean Boussier) via ruby-core
@ 2024-07-11 15:19 ` Eregon (Benoit Daloze) via ruby-core
2024-07-12 0:53 ` [ruby-core:118576] " shugo (Shugo Maeda) via ruby-core
` (13 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: Eregon (Benoit Daloze) via ruby-core @ 2024-07-11 15:19 UTC (permalink / raw)
To: ruby-core; +Cc: Eregon (Benoit Daloze)
Issue #20594 has been updated by Eregon (Benoit Daloze).
I wonder if that's a bug of `bytesplice`.
It's not like e.g. `setbyte` would respect the encoding (it cannot possibly).
OTOH `byteindex` also takes a String argument and raises `Encoding::CompatibilityError` for `"é".b.byteindex "é"` so I guess it's not so clear-cut currently (I get the replies above a bit better).
Another way to see the new method is concatenation without encoding negotiation/checking encoding compatibility.
Having byte/bytes in the name seems a good way to express that.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109086
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118576] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (9 preceding siblings ...)
2024-07-11 15:19 ` [ruby-core:118564] " Eregon (Benoit Daloze) via ruby-core
@ 2024-07-12 0:53 ` shugo (Shugo Maeda) via ruby-core
2024-07-18 19:15 ` [ruby-core:118632] " Dan0042 (Daniel DeLorme) via ruby-core
` (12 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: shugo (Shugo Maeda) via ruby-core @ 2024-07-12 0:53 UTC (permalink / raw)
To: ruby-core; +Cc: shugo (Shugo Maeda)
Issue #20594 has been updated by shugo (Shugo Maeda).
Eregon (Benoit Daloze) wrote in #note-11:
> I wonder if that's a bug of `bytesplice`.
> It's not like e.g. `setbyte` would respect the encoding (it cannot possibly).
> OTOH `byteindex` also takes a String argument and raises `Encoding::CompatibilityError` for `"é".b.byteindex "é"` so I guess it's not so clear-cut currently (I get the replies above a bit better).
It's intended behavior.
Having a long history of suffering from character encoding issues such as mojibake, we Japanese believe that operations on strings with different encodings should be approached with caution.
`byte-` methods could be considered exceptions, but at present, they are not. That's why Matz doesn't like the name `byteconcat`.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109101
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118632] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (10 preceding siblings ...)
2024-07-12 0:53 ` [ruby-core:118576] " shugo (Shugo Maeda) via ruby-core
@ 2024-07-18 19:15 ` Dan0042 (Daniel DeLorme) via ruby-core
2024-07-18 20:48 ` [ruby-core:118633] " alanwu (Alan Wu) via ruby-core
` (11 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: Dan0042 (Daniel DeLorme) via ruby-core @ 2024-07-18 19:15 UTC (permalink / raw)
To: ruby-core; +Cc: Dan0042 (Daniel DeLorme)
Issue #20594 has been updated by Dan0042 (Daniel DeLorme).
From the [dev meeting log](https://github.com/ruby/dev-meeting-log/blob/master/2024/DevMeeting-2024-07-11.md#feature-20594-a-new-string-method-to-append-bytes-while-preserving-encoding-byroot) these two points stuck out at me:
* nobu: can we use IO::Buffer?
* shyouhei: why not StringIO?
These are good questions that I also wondered about, but the answers are not recorded in the meeting log. What is the reason not to use these APIs which are pretty much designed for this exact use case? Why introduce a third API?
BTW I have to say I find it extremely ironic that we went to so much trouble to migrate from byte-oriented strings in 1.8 to character-oriented strings in 1.9, and now we're re-adding byte-oriented methods.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109159
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118633] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (11 preceding siblings ...)
2024-07-18 19:15 ` [ruby-core:118632] " Dan0042 (Daniel DeLorme) via ruby-core
@ 2024-07-18 20:48 ` alanwu (Alan Wu) via ruby-core
2024-07-19 8:07 ` [ruby-core:118636] " byroot (Jean Boussier) via ruby-core
` (10 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: alanwu (Alan Wu) via ruby-core @ 2024-07-18 20:48 UTC (permalink / raw)
To: ruby-core; +Cc: alanwu (Alan Wu)
Issue #20594 has been updated by alanwu (Alan Wu).
> What is the reason not to use these APIs which are pretty much designed for this exact use case? Why introduce a third API?
Sometimes only `String` gives you the most convenience and efficiency because of API interoperability. For example, binary protocols often have you reaching for Array#pack and String#unpack, and these don't directly work with StringIO and IO::Buffer.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109160
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118636] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (12 preceding siblings ...)
2024-07-18 20:48 ` [ruby-core:118633] " alanwu (Alan Wu) via ruby-core
@ 2024-07-19 8:07 ` byroot (Jean Boussier) via ruby-core
2024-07-26 6:20 ` [ruby-core:118690] " byroot (Jean Boussier) via ruby-core
` (9 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: byroot (Jean Boussier) via ruby-core @ 2024-07-19 8:07 UTC (permalink / raw)
To: ruby-core; +Cc: byroot (Jean Boussier)
Issue #20594 has been updated by byroot (Jean Boussier).
@Dan0042 I already answered your same question in [Feature #20394]. Whether we like it or not neither StringIO nor IO::Buffer are fast, compatible and convenient enough to replace `String` for these use case.
If you don't believe me, I encourage you to try using either in https://github.com/Shopify/protoboeuf or https://github.com/redis-rb/redis-client (or any other gem of your choice) and see how that goes.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109163
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118690] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (13 preceding siblings ...)
2024-07-19 8:07 ` [ruby-core:118636] " byroot (Jean Boussier) via ruby-core
@ 2024-07-26 6:20 ` byroot (Jean Boussier) via ruby-core
2024-07-30 8:03 ` [ruby-core:118735] " duerst via ruby-core
` (8 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: byroot (Jean Boussier) via ruby-core @ 2024-07-26 6:20 UTC (permalink / raw)
To: ruby-core; +Cc: byroot (Jean Boussier)
Issue #20594 has been updated by byroot (Jean Boussier).
Ok, so after thinking about this for a bit, I think a good name would be:
- `String#append_bytes(String) => self`
- `String#append_byte(Integer) => self`
It's still mentioning bytes, but it's not using the same `byte*` prefix as other methods, so I think it's different enough. If anything mentioning byte is deemed to confusing I can search or something else, but I really think that's the proper name for the concept.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109226
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118735] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (14 preceding siblings ...)
2024-07-26 6:20 ` [ruby-core:118690] " byroot (Jean Boussier) via ruby-core
@ 2024-07-30 8:03 ` duerst via ruby-core
2024-07-30 12:36 ` [ruby-core:118738] " Dan0042 (Daniel DeLorme) via ruby-core
` (7 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: duerst via ruby-core @ 2024-07-30 8:03 UTC (permalink / raw)
To: ruby-core; +Cc: duerst
Issue #20594 has been updated by duerst (Martin Dürst).
Eregon (Benoit Daloze) wrote in #note-9:
> My understanding of `byte*` methods is they treat the String as a byte array, which implies indices are just byte indices but also that the encoding is ignored (it seems clear when one does `"é".getbyte(0)`).
This may need a completely separate issue, but when I introduced `String#force_encoding`, I was imagining adding a block to it so that the forced encoding would only apply inside the block.
```Ruby
s = "your string (e.g. UTF-8 or whatever) here"
s.force_encoding(Encoding::BINARY) {
# binary operations go here
}
s.encoding # => Encoding:UTF-8
```
This would streamline things quite a bit; no need for separate `byte*` methods. Using a block should also help because in many use cases, there's be more than just one `byte*` method.
The question then would of course be how to optimize things. But to a large part, the optimizations are actually already done, because different encodings use different primitives.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109274
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118738] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (15 preceding siblings ...)
2024-07-30 8:03 ` [ruby-core:118735] " duerst via ruby-core
@ 2024-07-30 12:36 ` Dan0042 (Daniel DeLorme) via ruby-core
2024-08-01 9:30 ` [ruby-core:118763] " matz (Yukihiro Matsumoto) via ruby-core
` (6 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: Dan0042 (Daniel DeLorme) via ruby-core @ 2024-07-30 12:36 UTC (permalink / raw)
To: ruby-core; +Cc: Dan0042 (Daniel DeLorme)
Issue #20594 has been updated by Dan0042 (Daniel DeLorme).
duerst (Martin Dürst) wrote in #note-17:
> This may need a completely separate issue, but when I introduced `String#force_encoding`, I was imagining adding a block to it so that the forced encoding would only apply inside the block.
+1, that would be very convenient, but yes it's a different issue because it doesn't ignore Encoding::CompatibilityError which is the point of this proposal.
Many languages have a ByteArray or ByteBuffer separate from String, and I believe this is what ruby needs. Design-wise, in the long run, I believe it's better to evolve StringIO or IO::Buffer into a proper bytebuffer to handle encoding-agnostic bytes; and String should only handle encoding-aware characters. (I apologize for beating a dead horse.)
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109276
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118763] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (16 preceding siblings ...)
2024-07-30 12:36 ` [ruby-core:118738] " Dan0042 (Daniel DeLorme) via ruby-core
@ 2024-08-01 9:30 ` matz (Yukihiro Matsumoto) via ruby-core
2024-08-01 9:35 ` [ruby-core:118766] " byroot (Jean Boussier) via ruby-core
` (5 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: matz (Yukihiro Matsumoto) via ruby-core @ 2024-08-01 9:30 UTC (permalink / raw)
To: ruby-core; +Cc: matz (Yukihiro Matsumoto)
Issue #20594 has been updated by matz (Yukihiro Matsumoto).
`append_bytes` seems OK for me. Could you design the concrete behavior of the method:
* does it take more than one argument?
* does it take integers too?
* what is the result of the method encoding-wise?
* etc.
Matz.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109315
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118766] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (17 preceding siblings ...)
2024-08-01 9:30 ` [ruby-core:118763] " matz (Yukihiro Matsumoto) via ruby-core
@ 2024-08-01 9:35 ` byroot (Jean Boussier) via ruby-core
2024-08-02 17:09 ` [ruby-core:118780] " alanwu (Alan Wu) via ruby-core
` (4 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: byroot (Jean Boussier) via ruby-core @ 2024-08-01 9:35 UTC (permalink / raw)
To: ruby-core; +Cc: byroot (Jean Boussier)
Issue #20594 has been updated by byroot (Jean Boussier).
Thank you Matz.
I opened a PR that implement the method as envisioned: https://github.com/ruby/ruby/pull/11293
> does it take more than one argument?
No, a single argument, and only a String (T_STRING) as suggested by YJIT people to make it optimizable.
> does it take integers too?
Similarly, YJIT people suggested to not take different types. Hence why I proposed two methods `append_byte(Integer)` and `append_bytes(String)`.
> what is the result of the method encoding-wise?
The receiver encoding is never changed, even if it means that it's encoding become invalid. It's the called responsability to check `String#valid_encoding?` if that's a possibility and to deal with it.
It also means `append_bytes` never raises an `Encoding::CompatibilityError`.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109318
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118780] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (18 preceding siblings ...)
2024-08-01 9:35 ` [ruby-core:118766] " byroot (Jean Boussier) via ruby-core
@ 2024-08-02 17:09 ` alanwu (Alan Wu) via ruby-core
2024-08-06 12:15 ` [ruby-core:118799] " byroot (Jean Boussier) via ruby-core
` (3 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: alanwu (Alan Wu) via ruby-core @ 2024-08-02 17:09 UTC (permalink / raw)
To: ruby-core; +Cc: alanwu (Alan Wu)
Issue #20594 has been updated by alanwu (Alan Wu).
Let me present an alternative design that only adds one method. The name is
String#append_as_bytes, and the name provides a framing of "reinterpretation"
that helps to explain the behavior of the method.
```
call-seq:
append_as_bytes(*objects) -> self
Interpret arguments as bytes and append them to +self+ without changing the
encoding of +self+.
For each object that is a String, append the bytes of the string to +self+. For
each Integer object +i+, append a byte that is the bitwise AND of +i+ and
+0xff+. If any other type of objects is in +objects+, leave +self+ unmodified
and raise ArgumentError. This method does not attempt to implicitly convert any
arguments.
Examples:
7z_signature = ''.b
7z_signature.append_bytes('7z', 0xbc, 0xaf, 0x27, 0x1c) #=> "7z\xBC\xAF'\x1C"
```
It's clear from the name that the method has its own interpretation of
arguments. That gives a hint that it does something unusual, as it breaks
away from the default, "bytes with an encoding" stance of strings. It also
grammatically stands out from existing `*byte*` methods, a reflection of the
differences in behavior.
For string arguments, it's the same as `append_bytes(String)` from byroot. For
integers, the `i & 0xff` masking behavior comes from String#setbyte. Note that
it masks without making calls.
```ruby
-128 & 0xff # => 128
"x".tap{ _1.setbyte(0, -128)}.bytes # => [128]
```
This masking is how it interprets an integer as a byte.
The method does not accept arrays for simplicity, as splatting is already
available as a flexible option for callers.
I think this design strikes a good balance between usability, efficiency, and
how well compilers can handle it.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109332
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118799] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (19 preceding siblings ...)
2024-08-02 17:09 ` [ruby-core:118780] " alanwu (Alan Wu) via ruby-core
@ 2024-08-06 12:15 ` byroot (Jean Boussier) via ruby-core
2024-08-07 16:22 ` [ruby-core:118804] " tenderlovemaking (Aaron Patterson) via ruby-core
` (2 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: byroot (Jean Boussier) via ruby-core @ 2024-08-06 12:15 UTC (permalink / raw)
To: ruby-core; +Cc: byroot (Jean Boussier)
Issue #20594 has been updated by byroot (Jean Boussier).
> The name is `String#append_as_bytes`, and the name provides a framing of "reinterpretation" that helps to explain the behavior of the method.
I like that name, I think it clears out any possible confusion.
> `append_as_bytes(*objects)`
I retired the `*objects` from my proposal because I've been told it'd be too hard to optimize, but if you believe it's no problem, I definitely prefer it as it's much more Ruby like, and was my initial proposal.
> For integers, the i & 0xff masking behavior comes from String#setbyte.
I agree it makes sense to mirror `setbyte` rather than `String#<<(Integer)` here.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109355
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118804] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (20 preceding siblings ...)
2024-08-06 12:15 ` [ruby-core:118799] " byroot (Jean Boussier) via ruby-core
@ 2024-08-07 16:22 ` tenderlovemaking (Aaron Patterson) via ruby-core
2024-08-10 1:30 ` [ruby-core:118825] " mame (Yusuke Endoh) via ruby-core
2024-09-05 4:54 ` [ruby-core:119053] " matz (Yukihiro Matsumoto) via ruby-core
23 siblings, 0 replies; 25+ messages in thread
From: tenderlovemaking (Aaron Patterson) via ruby-core @ 2024-08-07 16:22 UTC (permalink / raw)
To: ruby-core; +Cc: tenderlovemaking (Aaron Patterson)
Issue #20594 has been updated by tenderlovemaking (Aaron Patterson).
No opinion on method name / API, but we've verified effectiveness of this optimization using our protobuf implementation and you can see the results [here](https://github.com/Shopify/protoboeuf/pull/116).
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109362
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:118825] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (21 preceding siblings ...)
2024-08-07 16:22 ` [ruby-core:118804] " tenderlovemaking (Aaron Patterson) via ruby-core
@ 2024-08-10 1:30 ` mame (Yusuke Endoh) via ruby-core
2024-09-05 4:54 ` [ruby-core:119053] " matz (Yukihiro Matsumoto) via ruby-core
23 siblings, 0 replies; 25+ messages in thread
From: mame (Yusuke Endoh) via ruby-core @ 2024-08-10 1:30 UTC (permalink / raw)
To: ruby-core; +Cc: mame (Yusuke Endoh)
Issue #20594 has been updated by mame (Yusuke Endoh).
Alan's proposal looks good to me. I don't think it is Ruby to design such an ordinary API with extreme care for optimization.
In fact, the same design as Alan's proposal was discussed at the dev meeting (sorry it was not written in the meeting log). The one difference between it and Alan's is that an exception should be thrown for Integers other than 0..255 until we have a convincing use case. I personally think it is reasonably convincing that it is the same as `setbyte`, though.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109392
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
* [ruby-core:119053] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
` (22 preceding siblings ...)
2024-08-10 1:30 ` [ruby-core:118825] " mame (Yusuke Endoh) via ruby-core
@ 2024-09-05 4:54 ` matz (Yukihiro Matsumoto) via ruby-core
23 siblings, 0 replies; 25+ messages in thread
From: matz (Yukihiro Matsumoto) via ruby-core @ 2024-09-05 4:54 UTC (permalink / raw)
To: ruby-core; +Cc: matz (Yukihiro Matsumoto)
Issue #20594 has been updated by matz (Yukihiro Matsumoto).
`String#append_as_bytes` looks good to me too. Accepted.
Matz.
----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109630
* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context
When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title <<
255 << body.bytesize << body
end
end
Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```
The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:
```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```
In many cases, you want to append to a String without changing the receiver's encoding.
The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.
### Previous discussion
There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975
### Existing solutions
You can of course always cast the strings you append to avoid this problem:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf <<
255 << title.bytesize << title.b <<
255 << body.bytesize << body.b
end
end
```
But this cause a lot of needless allocations.
You'd think you could also use `bytesplice`, but it actually has the same issue:
```ruby
Post = Struct.new(:title, :body) do
def serialize(buf)
buf << 255 << title.bytesize
buf.bytesplice(buf.bytesize, title.bytesize, title)
buf << 255 << body.bytesize
buf.bytesplice(buf.bytesize, body.bytesize, title)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```
And even if it worked, it would be very unergonomic.
### Proposal: a `byteconcat` method
A solution to this would be to add a new `byteconcat` method, that could be shimed as:
```ruby
class String
def byteconcat(*strings)
strings.map! do |s|
if s.is_a?(String) && s.encoding != encoding
s.dup.force_encoding(encoding)
else
s
end
end
concat(*strings)
end
end
Post = Struct.new(:title, :body) do
def serialize(buf)
buf.byteconcat(
255, title.bytesize, title,
255, body.bytesize, body,
)
end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```
But of course a builtin implementation wouldn't need to dup the arguments.
Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.
### Method name and signature
#### Name
This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:
- `byteappend` (like `Array#append`)
- `bytepush` (like `Array#push`)
#### Signature
This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.
The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.
The proposed method returns self, like `concat` and others.
### YJIT consideration
I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2024-09-05 4:55 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-25 6:39 [ruby-core:118388] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding byroot (Jean Boussier) via ruby-core
2024-06-25 15:33 ` [ruby-core:118389] " Eregon (Benoit Daloze) via ruby-core
2024-06-26 15:51 ` [ruby-core:118390] " maximecb (Maxime Chevalier-Boisvert) via ruby-core
2024-07-11 5:06 ` [ruby-core:118543] " matz (Yukihiro Matsumoto) via ruby-core
2024-07-11 5:18 ` [ruby-core:118545] " byroot (Jean Boussier) via ruby-core
2024-07-11 5:38 ` [ruby-core:118547] " byroot (Jean Boussier) via ruby-core
2024-07-11 8:43 ` [ruby-core:118554] " mame (Yusuke Endoh) via ruby-core
2024-07-11 8:45 ` [ruby-core:118555] " byroot (Jean Boussier) via ruby-core
2024-07-11 10:54 ` [ruby-core:118560] " Eregon (Benoit Daloze) via ruby-core
2024-07-11 12:49 ` [ruby-core:118562] " byroot (Jean Boussier) via ruby-core
2024-07-11 15:19 ` [ruby-core:118564] " Eregon (Benoit Daloze) via ruby-core
2024-07-12 0:53 ` [ruby-core:118576] " shugo (Shugo Maeda) via ruby-core
2024-07-18 19:15 ` [ruby-core:118632] " Dan0042 (Daniel DeLorme) via ruby-core
2024-07-18 20:48 ` [ruby-core:118633] " alanwu (Alan Wu) via ruby-core
2024-07-19 8:07 ` [ruby-core:118636] " byroot (Jean Boussier) via ruby-core
2024-07-26 6:20 ` [ruby-core:118690] " byroot (Jean Boussier) via ruby-core
2024-07-30 8:03 ` [ruby-core:118735] " duerst via ruby-core
2024-07-30 12:36 ` [ruby-core:118738] " Dan0042 (Daniel DeLorme) via ruby-core
2024-08-01 9:30 ` [ruby-core:118763] " matz (Yukihiro Matsumoto) via ruby-core
2024-08-01 9:35 ` [ruby-core:118766] " byroot (Jean Boussier) via ruby-core
2024-08-02 17:09 ` [ruby-core:118780] " alanwu (Alan Wu) via ruby-core
2024-08-06 12:15 ` [ruby-core:118799] " byroot (Jean Boussier) via ruby-core
2024-08-07 16:22 ` [ruby-core:118804] " tenderlovemaking (Aaron Patterson) via ruby-core
2024-08-10 1:30 ` [ruby-core:118825] " mame (Yusuke Endoh) via ruby-core
2024-09-05 4:54 ` [ruby-core:119053] " matz (Yukihiro Matsumoto) via ruby-core
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).