ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed
From: "rhenium (Kazuki Yamaguchi) via ruby-core" <ruby-core@ml.ruby-lang.org>
To: ruby-core@ml.ruby-lang.org
Cc: "rhenium (Kazuki Yamaguchi)" <noreply@ruby-lang.org>
Subject: [ruby-core:119836] [Ruby master Feature#20878] A new C API to create a String by adopting a pointer: `rb_enc_str_adopt(const char *ptr, long len, long capa, rb_encoding *enc)`
Date: Fri, 08 Nov 2024 08:56:52 +0000 (UTC)	[thread overview]
Message-ID: <redmine.journal-110524.20241108085652.7941@ruby-lang.org> (raw)
In-Reply-To: <redmine.issue-20878.20241107103140.7941@ruby-lang.org>

Issue #20878 has been updated by rhenium (Kazuki Yamaguchi).


byroot (Jean Boussier) wrote:
> #### Work inside RString allocated memory
> [...]
> The downside with this approach is that it contains a lot of inefficiencies, as `rb_str_set_len` will perform
> numerous safety checks, compute coderange, and write the string terminator on every invocation.

I thought `rb_str_set_len()` was supposed to be the efficient alternative to `rb_str_resize()` meant for such a purpose.

I think an assert on the capacity or filling the terminator is cheap enough that it won't matter. That it computes coderange is news to me - I found it was since commit commit:6b66b5fdedb2c9a9ee48e290d57ca7f8d55e01a2 / [Bug #19902] in 2023. I think correcting coderange after directly modifying the RString-managed buffer is the caller's responsibility. Perhaps it could be reversed?


----------------------------------------
Feature #20878: A new C API to create a String by adopting a pointer: `rb_enc_str_adopt(const char *ptr, long len, long capa, rb_encoding *enc)`
https://bugs.ruby-lang.org/issues/20878#change-110524

* Author: byroot (Jean Boussier)
* Status: Open
----------------------------------------
### Context

A common use case when writing C extensions is to generate text or bytes into a buffer, and to return it back
wrapped into a Ruby String. Examples are `JSON.generate(obj) -> String`, and all other format serializers,
compression libraries such as `ZLib.deflate`, etc, but also methods such as `Time.strftime`, 

### Current Solution

#### Work in a buffer and copy the result

The most often used solution is to work with a native buffer and to manage a native allocated buffer,
and once the generation is done, call `rb_str_new*` to copy the result inside memory managed by Ruby.

It works, but isn't very efficient because it cause an extra copy and an extra `free()`.

On `ruby/json` macro-benchmarks, this represent around 5% of the time spent in `JSON.generate`.

```c
static void fbuffer_free(FBuffer *fb)
{
    if (fb->ptr && fb->type == FBUFFER_HEAP_ALLOCATED) {
        ruby_xfree(fb->ptr);
    }
}

static VALUE fbuffer_to_s(FBuffer *fb)
{
    VALUE result = rb_utf8_str_new(FBUFFER_PTR(fb), FBUFFER_LEN(fb));
    fbuffer_free(fb);
    return result;
}
```

#### Work inside RString allocated memory

Another way this is currently done, is to allocate an `RString` using `rb_str_buf_new`,
and write into it with various functions such as `rb_str_catf`,
or writing past `RString.len` through `RSTRING_PTR` and then resize it with `rb_str_set_len`.

The downside with this approach is that it contains a lot of inefficiencies, as `rb_str_set_len` will perform
numerous safety checks, compute coderange, and write the string terminator on every invocation.

Another major inneficiency is that this API make it hard to be in control of the buffer
growth, so it can result in a lot more `realloc()` calls than manually managing the buffer.

This method is used by `Kernel#sprintf`, `Time#strftime` etc, and when I attempted to improve `Time#strftime`
performance, this problem showed up as the biggest bottleneck:

  - https://github.com/ruby/ruby/pull/11547
  - https://github.com/ruby/ruby/pull/11544
  - https://github.com/ruby/ruby/pull/11542

### Proposed API

I think a more effcient way to do this would be to work with a native buffer, and then build a RString
that "adopt" the memory region.

Technically, you can currently do this by reaching directly into `RString` members, but I don't think it's clean,
and a dedicated API would be preferable:

```c
/**
 * Similar to rb_str_new(), but it adopts the pointer instead of copying.
 *
 * @param[in]  ptr             A memory region of `capa` bytes length. MUST have been allocated with `ruby_xmalloc`
 * @param[in]  len             Length  of the string,  in bytes,  not including  the
 *                             terminating NUL character, not including extra capacity.
 * @param[in]  capa            The usable length of `ptr`, in bytes,  including  the
 *                             terminating NUL character.
 * @param[in]  enc             Encoding of `ptr`.
 * @exception  rb_eArgError    `len` is negative.
 * @return     An instance  of ::rb_cString,  of `len`  bytes length, `capa - 1` bytes capacity,
 *             and of `enc` encoding.
 * @pre        At  least  `capa` bytes  of  continuous  memory region  shall  be
 *             accessible via `ptr`.
 * @pre        `ptr` MUST have been allocated with `ruby_xmalloc`.
 * @pre        `ptr` MUST not be manually freed after `rb_enc_str_adopt` has been called.
 * @note       `enc` can be a  null pointer.  It can also be  seen as a routine
 *             identical to rb_usascii_str_new() then.
 */
rb_enc_str_adopt(const char *ptr, long len, long capa, rb_encoding *enc);
```

An alternative to the `adopt` term, could be `move`.




-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/

  parent reply	other threads:[~2024-11-08  8:57 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-07 10:31 [ruby-core:119801] " byroot (Jean Boussier) via ruby-core
2024-11-07 14:54 ` [ruby-core:119813] " Eregon (Benoit Daloze) via ruby-core
2024-11-07 15:38 ` [ruby-core:119815] " byroot (Jean Boussier) via ruby-core
2024-11-07 16:40 ` [ruby-core:119816] " nobu (Nobuyoshi Nakada) via ruby-core
2024-11-07 17:14 ` [ruby-core:119819] " byroot (Jean Boussier) via ruby-core
2024-11-08  0:02 ` [ruby-core:119828] " shyouhei (Shyouhei Urabe) via ruby-core
2024-11-08  3:20 ` [ruby-core:119830] " nobu (Nobuyoshi Nakada) via ruby-core
2024-11-08  7:53 ` [ruby-core:119834] " byroot (Jean Boussier) via ruby-core
2024-11-08  8:43 ` [ruby-core:119835] " shyouhei (Shyouhei Urabe) via ruby-core
2024-11-08  8:56 ` rhenium (Kazuki Yamaguchi) via ruby-core [this message]
2024-11-08 10:08 ` [ruby-core:119840] " byroot (Jean Boussier) via ruby-core
2024-11-08 15:47 ` [ruby-core:119847] " kddnewton (Kevin Newton) via ruby-core
2024-11-08 17:30 ` [ruby-core:119848] " mdalessio (Mike Dalessio) via ruby-core
2024-11-21 17:40 ` [ruby-core:119982] " byroot (Jean Boussier) via ruby-core
2024-11-22  8:49 ` [ruby-core:119989] " nobu (Nobuyoshi Nakada) via ruby-core
2024-11-22  8:50 ` [ruby-core:119990] " byroot (Jean Boussier) via ruby-core

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=redmine.journal-110524.20241108085652.7941@ruby-lang.org \
    --to=ruby-core@ml.ruby-lang.org \
    --cc=noreply@ruby-lang.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).