ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed
* [ruby-core:119000] [Ruby master Bug#20710] Reducing Hash allocation introduces large performance degradation (probably related to VWA)
@ 2024-09-02  6:24 pocke (Masataka Kuwabara) via ruby-core
  2024-09-02 16:49 ` [ruby-core:119015] " peterzhu2118 (Peter Zhu) via ruby-core
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: pocke (Masataka Kuwabara) via ruby-core @ 2024-09-02  6:24 UTC (permalink / raw)
  To: ruby-core; +Cc: pocke (Masataka Kuwabara)

Issue #20710 has been reported by pocke (Masataka Kuwabara).

----------------------------------------
Bug #20710: Reducing Hash allocation introduces large performance degradation (probably related to VWA)
https://bugs.ruby-lang.org/issues/20710

* Author: pocke (Masataka Kuwabara)
* Status: Open
* ruby -v: ruby 3.3.4 (2024-07-09 revision be1089c8ec) [arm64-darwin21]
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
I found a surprising performance degradation while developing RBS.
In short, I tried to remove unnecessary Hash allocations for RBS. Then, it made the execution time 2x slower.

VWA for Hash probably causes this degradation. I'd be happy if we could mitigate the impact by updating the memory management strategy.


## Reproduce

You can reproduce this problem on a PR in pocke/rbs repository.
https://github.com/pocke/rbs/pull/2
This PR dedups empty Hash objects.

1. `git clone` and checkout
1. `bundle install`
1. `bundle exec rake compile` for C-ext
1. `bundle ruby benchmark/benchmark_new_env.rb`

The "before" commit is https://github.com/pocke/rbs/commit/2c356c060286429cfdb034f88a74a6f94420fd21.
The "after" commit is https://github.com/pocke/rbs/commit/bfb2c367c7d3b7f93720392252d3a3980d7bf335.

The benchmark results are the following:

```
# Before
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
             new_env      6.426 (±15.6%) i/s -     64.000 in  10.125442s
       new_rails_env      0.968 (± 0.0%) i/s -     10.000 in  10.355738s

# After
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
             new_env      4.371 (±22.9%) i/s -     43.000 in  10.150192s
       new_rails_env      0.360 (± 0.0%) i/s -      4.000 in  11.313158s
```

The IPS decreased 1.47x for `new_env` case (parsing small RBS env), and 2.69x for `new_rails_env` (parsing large RBS env).


## Investigation

### GC.stat

`GC.stat` indicates the number of minor GCs increases.

```ruby
# In the RBS repository
require_relative './benchmark/utils'

tmpdir = prepare_collection!
new_rails_env(tmpdir)
pp GC.stat
```


```
# before
{:count=>126,
 :time=>541,
 :marking_time=>496,
 :sweeping_time=>45,
 :heap_allocated_pages=>702,
 :heap_sorted_length=>984,
 :heap_allocatable_pages=>282,
 :heap_available_slots=>793270,
 :heap_live_slots=>787407,
 :heap_free_slots=>5863,
 :heap_final_slots=>0,
 :heap_marked_slots=>757744,
 :heap_eden_pages=>702,
 :heap_tomb_pages=>0,
 :total_allocated_pages=>702,
 :total_freed_pages=>0,
 :total_allocated_objects=>2220605,
 :total_freed_objects=>1433198,
 :malloc_increase_bytes=>5872,
 :malloc_increase_bytes_limit=>16777216,
 :minor_gc_count=>112,
 :major_gc_count=>14,
 :compact_count=>0,
 :read_barrier_faults=>0,
 :total_moved_objects=>0,
 :remembered_wb_unprotected_objects=>0,
 :remembered_wb_unprotected_objects_limit=>4779,
 :old_objects=>615704,
 :old_objects_limit=>955872,
 :oldmalloc_increase_bytes=>210912,
 :oldmalloc_increase_bytes_limit=>16777216}

# after
{:count=>255,
 :time=>1551,
 :marking_time=>1496,
 :sweeping_time=>55,
 :heap_allocated_pages=>570,
 :heap_sorted_length=>1038,
 :heap_allocatable_pages=>468,
 :heap_available_slots=>735520,
 :heap_live_slots=>731712,
 :heap_free_slots=>3808,
 :heap_final_slots=>0,
 :heap_marked_slots=>728727,
 :heap_eden_pages=>570,
 :heap_tomb_pages=>0,
 :total_allocated_pages=>570,
 :total_freed_pages=>0,
 :total_allocated_objects=>2183278,
 :total_freed_objects=>1451566,
 :malloc_increase_bytes=>1200,
 :malloc_increase_bytes_limit=>16777216,
 :minor_gc_count=>242,
 :major_gc_count=>13,
 :compact_count=>0,
 :read_barrier_faults=>0,
 :total_moved_objects=>0,
 :remembered_wb_unprotected_objects=>0,
 :remembered_wb_unprotected_objects_limit=>5915,
 :old_objects=>600594,
 :old_objects_limit=>1183070,
 :oldmalloc_increase_bytes=>8128,
 :oldmalloc_increase_bytes_limit=>16777216}
```

### Warming up Hashes

The following patch, which creates unnecessary Hash objects before the benchmark, improves the execution time.


```diff
diff --git a/benchmark/benchmark_new_env.rb b/benchmark/benchmark_new_env.rb
index 6dd2b73f..a8da61c6 100644
--- a/benchmark/benchmark_new_env.rb
+++ b/benchmark/benchmark_new_env.rb
@@ -4,6 +4,8 @@ require 'benchmark/ips'
 
 tmpdir = prepare_collection!
 
+(0..30_000_000).map { {} }
+
 Benchmark.ips do |x|
   x.time = 10
```


The results are the following:

```
# Before
Calculating -------------------------------------
             new_env     10.354 (± 9.7%) i/s -    103.000 in  10.013834s
       new_rails_env      1.661 (± 0.0%) i/s -     17.000 in  10.282490s

# After
Calculating -------------------------------------
             new_env     10.771 (± 9.3%) i/s -    107.000 in  10.010446s
       new_rails_env      1.584 (± 0.0%) i/s -     16.000 in  10.178984s
```


### `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO`

The `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO` env var also mitigates the performance impact.
In this example, I set `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.6` (default: 0.20)

```console
# Before
Calculating -------------------------------------
             new_env     10.271 (± 9.7%) i/s -    102.000 in  10.087191s
       new_rails_env      1.529 (± 0.0%) i/s -     16.000 in  10.538043s

# After
$ env RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.6 bundle exec ruby benchmark/benchmark_new_env.rb
Calculating -------------------------------------
             new_env     11.003 (± 9.1%) i/s -    110.000 in  10.068428s
       new_rails_env      1.347 (± 0.0%) i/s -     14.000 in  11.117665s
```


## Additional Information

* I applied the same change to Array. But it does not cause this problem.
  * I guess the cause is the difference of the Size Pool. An empty Array uses 40 bytes like the ordinal Ruby object, but an empty Hash uses 160 bytes.
  * The Size Pool for 160 bytes objects has fewer objects than the 40 bytes one. So, reducing allocation affects the performance sensitively.
* I tried it on Ruby 3.2. This change on Ruby 3.2 does not degrade the execution time.
  * VWA for Hash is introduced since Ruby 3.3. https://github.com/ruby/ruby/blob/73c39a5f93d3ad4514a06158e2bb7622496372b9/doc/NEWS/NEWS-3.3.0.md#gc--memory-management



## Acknowledgement

@mame, @ko1, and @soutaro helped the investigation. I would like to thank them. 



-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:119015] [Ruby master Bug#20710] Reducing Hash allocation introduces large performance degradation (probably related to VWA)
  2024-09-02  6:24 [ruby-core:119000] [Ruby master Bug#20710] Reducing Hash allocation introduces large performance degradation (probably related to VWA) pocke (Masataka Kuwabara) via ruby-core
@ 2024-09-02 16:49 ` peterzhu2118 (Peter Zhu) via ruby-core
  2024-09-02 16:53 ` [ruby-core:119016] " byroot (Jean Boussier) via ruby-core
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: peterzhu2118 (Peter Zhu) via ruby-core @ 2024-09-02 16:49 UTC (permalink / raw)
  To: ruby-core; +Cc: peterzhu2118 (Peter Zhu)

Issue #20710 has been updated by peterzhu2118 (Peter Zhu).


Thank you for the detailed report.

I had a discussion about this with @mame and @byroot last year in the Ruby core slack: https://ruby.slack.com/archives/C02A3SL0S/p1702910003614609

One of the big differences is that VWA eagerly allocates the AR/ST table for the hash, rather than lazily allocate it. This means that for empty hashes, there is a performance penalty for empty hashes, but the penalty can be reclaimed when we have elements inside of the hash.

Another issue with microbenchmarks is that some of the size pools may be small since there aren't many objects, so it may cause GC to trigger more frequently. We don't see this kind of issues with macrobenchmarks since they keep more objects alive after bootup. I can see that you've observed this issue in your benchmark as well.

In the following microbenchmark, we can see that with hashes <= 8 elements (i.e. AR hashes), the performance in Ruby 3.3 is basically on-par with Ruby 3.2, but when we switch to ST tables (9 elements), we can see that VWA is significantly faster:

```
ruby 3.2.4 (2024-04-23 revision af471c0e01) [arm64-darwin23]
       user     system      total        real
Hash with 0 elements  2.421699   0.013686   2.435385 (  2.436581)
Hash with 1 elements  2.955857   0.029542   2.985399 (  3.014737)
Hash with 2 elements  2.891668   0.019301   2.910969 (  2.921928)
Hash with 3 elements  2.900170   0.015396   2.915566 (  2.916644)
Hash with 4 elements  2.889895   0.014969   2.904864 (  2.905188)
Hash with 5 elements  2.895059   0.017253   2.912312 (  2.912845)
Hash with 6 elements  2.869016   0.014351   2.883367 (  2.883618)
Hash with 7 elements  2.907134   0.016862   2.923996 (  2.924871)
Hash with 8 elements  2.926749   0.020445   2.947194 (  2.956753)
Hash with 9 elements 19.932546   0.551577  20.484123 ( 20.498173)


ruby 3.3.3 (2024-06-12 revision f1c7b6f435) [arm64-darwin23]
       user     system      total        real
Hash with 0 elements  2.591444   0.023060   2.614504 (  2.616658)
Hash with 1 elements  3.052488   0.030433   3.082921 (  3.102709)
Hash with 2 elements  3.064385   0.027627   3.092012 (  3.106096)
Hash with 3 elements  3.036935   0.023353   3.060288 (  3.063819)
Hash with 4 elements  3.020218   0.022274   3.042492 (  3.043182)
Hash with 5 elements  3.053680   0.025551   3.079231 (  3.083070)
Hash with 6 elements  2.991555   0.023347   3.014902 (  3.017601)
Hash with 7 elements  3.011856   0.026142   3.037998 (  3.041611)
Hash with 8 elements  3.044671   0.033276   3.077947 (  3.109949)
Hash with 9 elements 14.873814   0.400856  15.274670 ( 15.309215)
```


```ruby
require "benchmark"

TIMES = 100_000_000

Benchmark.bm do |x|
  x.report("Hash with 0 elements") do
    TIMES.times { {} }
  end

  x.report("Hash with 1 elements") do
    TIMES.times { { a: 0 } }
  end

  x.report("Hash with 2 elements") do
    TIMES.times { { a: 0, b: 0 } }
  end

  x.report("Hash with 3 elements") do
    TIMES.times { { a: 0, b: 0, c: 0 } }
  end

  x.report("Hash with 4 elements") do
    TIMES.times { { a: 0, b: 0, c: 0, d: 0 } }
  end

  x.report("Hash with 5 elements") do
    TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0 } }
  end

  x.report("Hash with 6 elements") do
    TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0 } }
  end

  x.report("Hash with 7 elements") do
    TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0 } }
  end

  x.report("Hash with 8 elements") do
    TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0, h: 0 } }
  end

  x.report("Hash with 9 elements") do
    TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0, h: 0, i: 0 } }
  end
end
```


----------------------------------------
Bug #20710: Reducing Hash allocation introduces large performance degradation (probably related to VWA)
https://bugs.ruby-lang.org/issues/20710#change-109590

* Author: pocke (Masataka Kuwabara)
* Status: Open
* ruby -v: ruby 3.3.4 (2024-07-09 revision be1089c8ec) [arm64-darwin21]
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
I found a surprising performance degradation while developing RBS.
In short, I tried to remove unnecessary Hash allocations for RBS. Then, it made the execution time 2x slower.

VWA for Hash probably causes this degradation. I'd be happy if we could mitigate the impact by updating the memory management strategy.


## Reproduce

You can reproduce this problem on a PR in pocke/rbs repository.
https://github.com/pocke/rbs/pull/2
This PR dedups empty Hash objects.

1. `git clone` and checkout
1. `bundle install`
1. `bundle exec rake compile` for C-ext
1. `bundle ruby benchmark/benchmark_new_env.rb`

The "before" commit is https://github.com/pocke/rbs/commit/2c356c060286429cfdb034f88a74a6f94420fd21.
The "after" commit is https://github.com/pocke/rbs/commit/bfb2c367c7d3b7f93720392252d3a3980d7bf335.

The benchmark results are the following:

```
# Before
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
             new_env      6.426 (±15.6%) i/s -     64.000 in  10.125442s
       new_rails_env      0.968 (± 0.0%) i/s -     10.000 in  10.355738s

# After
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
             new_env      4.371 (±22.9%) i/s -     43.000 in  10.150192s
       new_rails_env      0.360 (± 0.0%) i/s -      4.000 in  11.313158s
```

The IPS decreased 1.47x for `new_env` case (parsing small RBS env), and 2.69x for `new_rails_env` (parsing large RBS env).


## Investigation

### GC.stat

`GC.stat` indicates the number of minor GCs increases.

```ruby
# In the RBS repository
require_relative './benchmark/utils'

tmpdir = prepare_collection!
new_rails_env(tmpdir)
pp GC.stat
```


```
# before
{:count=>126,
 :time=>541,
 :marking_time=>496,
 :sweeping_time=>45,
 :heap_allocated_pages=>702,
 :heap_sorted_length=>984,
 :heap_allocatable_pages=>282,
 :heap_available_slots=>793270,
 :heap_live_slots=>787407,
 :heap_free_slots=>5863,
 :heap_final_slots=>0,
 :heap_marked_slots=>757744,
 :heap_eden_pages=>702,
 :heap_tomb_pages=>0,
 :total_allocated_pages=>702,
 :total_freed_pages=>0,
 :total_allocated_objects=>2220605,
 :total_freed_objects=>1433198,
 :malloc_increase_bytes=>5872,
 :malloc_increase_bytes_limit=>16777216,
 :minor_gc_count=>112,
 :major_gc_count=>14,
 :compact_count=>0,
 :read_barrier_faults=>0,
 :total_moved_objects=>0,
 :remembered_wb_unprotected_objects=>0,
 :remembered_wb_unprotected_objects_limit=>4779,
 :old_objects=>615704,
 :old_objects_limit=>955872,
 :oldmalloc_increase_bytes=>210912,
 :oldmalloc_increase_bytes_limit=>16777216}

# after
{:count=>255,
 :time=>1551,
 :marking_time=>1496,
 :sweeping_time=>55,
 :heap_allocated_pages=>570,
 :heap_sorted_length=>1038,
 :heap_allocatable_pages=>468,
 :heap_available_slots=>735520,
 :heap_live_slots=>731712,
 :heap_free_slots=>3808,
 :heap_final_slots=>0,
 :heap_marked_slots=>728727,
 :heap_eden_pages=>570,
 :heap_tomb_pages=>0,
 :total_allocated_pages=>570,
 :total_freed_pages=>0,
 :total_allocated_objects=>2183278,
 :total_freed_objects=>1451566,
 :malloc_increase_bytes=>1200,
 :malloc_increase_bytes_limit=>16777216,
 :minor_gc_count=>242,
 :major_gc_count=>13,
 :compact_count=>0,
 :read_barrier_faults=>0,
 :total_moved_objects=>0,
 :remembered_wb_unprotected_objects=>0,
 :remembered_wb_unprotected_objects_limit=>5915,
 :old_objects=>600594,
 :old_objects_limit=>1183070,
 :oldmalloc_increase_bytes=>8128,
 :oldmalloc_increase_bytes_limit=>16777216}
```

### Warming up Hashes

The following patch, which creates unnecessary Hash objects before the benchmark, improves the execution time.


```diff
diff --git a/benchmark/benchmark_new_env.rb b/benchmark/benchmark_new_env.rb
index 6dd2b73f..a8da61c6 100644
--- a/benchmark/benchmark_new_env.rb
+++ b/benchmark/benchmark_new_env.rb
@@ -4,6 +4,8 @@ require 'benchmark/ips'
 
 tmpdir = prepare_collection!
 
+(0..30_000_000).map { {} }
+
 Benchmark.ips do |x|
   x.time = 10
```


The results are the following:

```
# Before
Calculating -------------------------------------
             new_env     10.354 (± 9.7%) i/s -    103.000 in  10.013834s
       new_rails_env      1.661 (± 0.0%) i/s -     17.000 in  10.282490s

# After
Calculating -------------------------------------
             new_env     10.771 (± 9.3%) i/s -    107.000 in  10.010446s
       new_rails_env      1.584 (± 0.0%) i/s -     16.000 in  10.178984s
```


### `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO`

The `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO` env var also mitigates the performance impact.
In this example, I set `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.6` (default: 0.20)

```console
# Before
Calculating -------------------------------------
             new_env     10.271 (± 9.7%) i/s -    102.000 in  10.087191s
       new_rails_env      1.529 (± 0.0%) i/s -     16.000 in  10.538043s

# After
$ env RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.6 bundle exec ruby benchmark/benchmark_new_env.rb
Calculating -------------------------------------
             new_env     11.003 (± 9.1%) i/s -    110.000 in  10.068428s
       new_rails_env      1.347 (± 0.0%) i/s -     14.000 in  11.117665s
```


## Additional Information

* I applied the same change to Array. But it does not cause this problem.
  * I guess the cause is the difference of the Size Pool. An empty Array uses 40 bytes like the ordinal Ruby object, but an empty Hash uses 160 bytes.
  * The Size Pool for 160 bytes objects has fewer objects than the 40 bytes one. So, reducing allocation affects the performance sensitively.
* I tried it on Ruby 3.2. This change on Ruby 3.2 does not degrade the execution time.
  * VWA for Hash is introduced since Ruby 3.3. https://github.com/ruby/ruby/blob/73c39a5f93d3ad4514a06158e2bb7622496372b9/doc/NEWS/NEWS-3.3.0.md#gc--memory-management



## Acknowledgement

@mame, @ko1, and @soutaro helped the investigation. I would like to thank them. 



-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:119016] [Ruby master Bug#20710] Reducing Hash allocation introduces large performance degradation (probably related to VWA)
  2024-09-02  6:24 [ruby-core:119000] [Ruby master Bug#20710] Reducing Hash allocation introduces large performance degradation (probably related to VWA) pocke (Masataka Kuwabara) via ruby-core
  2024-09-02 16:49 ` [ruby-core:119015] " peterzhu2118 (Peter Zhu) via ruby-core
@ 2024-09-02 16:53 ` byroot (Jean Boussier) via ruby-core
  2024-09-04  3:28 ` [ruby-core:119032] " mame (Yusuke Endoh) via ruby-core
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: byroot (Jean Boussier) via ruby-core @ 2024-09-02 16:53 UTC (permalink / raw)
  To: ruby-core; +Cc: byroot (Jean Boussier)

Issue #20710 has been updated by byroot (Jean Boussier).


I still think free pages should be in a global pool rather than tied to a specific pool size. I believe that would solve this issue.

And yes we don't see it on macro benchmarks, but it might still cause more frequent GC than necessary.

----------------------------------------
Bug #20710: Reducing Hash allocation introduces large performance degradation (probably related to VWA)
https://bugs.ruby-lang.org/issues/20710#change-109591

* Author: pocke (Masataka Kuwabara)
* Status: Open
* ruby -v: ruby 3.3.4 (2024-07-09 revision be1089c8ec) [arm64-darwin21]
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
I found a surprising performance degradation while developing RBS.
In short, I tried to remove unnecessary Hash allocations for RBS. Then, it made the execution time 2x slower.

VWA for Hash probably causes this degradation. I'd be happy if we could mitigate the impact by updating the memory management strategy.


## Reproduce

You can reproduce this problem on a PR in pocke/rbs repository.
https://github.com/pocke/rbs/pull/2
This PR dedups empty Hash objects.

1. `git clone` and checkout
1. `bundle install`
1. `bundle exec rake compile` for C-ext
1. `bundle ruby benchmark/benchmark_new_env.rb`

The "before" commit is https://github.com/pocke/rbs/commit/2c356c060286429cfdb034f88a74a6f94420fd21.
The "after" commit is https://github.com/pocke/rbs/commit/bfb2c367c7d3b7f93720392252d3a3980d7bf335.

The benchmark results are the following:

```
# Before
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
             new_env      6.426 (±15.6%) i/s -     64.000 in  10.125442s
       new_rails_env      0.968 (± 0.0%) i/s -     10.000 in  10.355738s

# After
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
             new_env      4.371 (±22.9%) i/s -     43.000 in  10.150192s
       new_rails_env      0.360 (± 0.0%) i/s -      4.000 in  11.313158s
```

The IPS decreased 1.47x for `new_env` case (parsing small RBS env), and 2.69x for `new_rails_env` (parsing large RBS env).


## Investigation

### GC.stat

`GC.stat` indicates the number of minor GCs increases.

```ruby
# In the RBS repository
require_relative './benchmark/utils'

tmpdir = prepare_collection!
new_rails_env(tmpdir)
pp GC.stat
```


```
# before
{:count=>126,
 :time=>541,
 :marking_time=>496,
 :sweeping_time=>45,
 :heap_allocated_pages=>702,
 :heap_sorted_length=>984,
 :heap_allocatable_pages=>282,
 :heap_available_slots=>793270,
 :heap_live_slots=>787407,
 :heap_free_slots=>5863,
 :heap_final_slots=>0,
 :heap_marked_slots=>757744,
 :heap_eden_pages=>702,
 :heap_tomb_pages=>0,
 :total_allocated_pages=>702,
 :total_freed_pages=>0,
 :total_allocated_objects=>2220605,
 :total_freed_objects=>1433198,
 :malloc_increase_bytes=>5872,
 :malloc_increase_bytes_limit=>16777216,
 :minor_gc_count=>112,
 :major_gc_count=>14,
 :compact_count=>0,
 :read_barrier_faults=>0,
 :total_moved_objects=>0,
 :remembered_wb_unprotected_objects=>0,
 :remembered_wb_unprotected_objects_limit=>4779,
 :old_objects=>615704,
 :old_objects_limit=>955872,
 :oldmalloc_increase_bytes=>210912,
 :oldmalloc_increase_bytes_limit=>16777216}

# after
{:count=>255,
 :time=>1551,
 :marking_time=>1496,
 :sweeping_time=>55,
 :heap_allocated_pages=>570,
 :heap_sorted_length=>1038,
 :heap_allocatable_pages=>468,
 :heap_available_slots=>735520,
 :heap_live_slots=>731712,
 :heap_free_slots=>3808,
 :heap_final_slots=>0,
 :heap_marked_slots=>728727,
 :heap_eden_pages=>570,
 :heap_tomb_pages=>0,
 :total_allocated_pages=>570,
 :total_freed_pages=>0,
 :total_allocated_objects=>2183278,
 :total_freed_objects=>1451566,
 :malloc_increase_bytes=>1200,
 :malloc_increase_bytes_limit=>16777216,
 :minor_gc_count=>242,
 :major_gc_count=>13,
 :compact_count=>0,
 :read_barrier_faults=>0,
 :total_moved_objects=>0,
 :remembered_wb_unprotected_objects=>0,
 :remembered_wb_unprotected_objects_limit=>5915,
 :old_objects=>600594,
 :old_objects_limit=>1183070,
 :oldmalloc_increase_bytes=>8128,
 :oldmalloc_increase_bytes_limit=>16777216}
```

### Warming up Hashes

The following patch, which creates unnecessary Hash objects before the benchmark, improves the execution time.


```diff
diff --git a/benchmark/benchmark_new_env.rb b/benchmark/benchmark_new_env.rb
index 6dd2b73f..a8da61c6 100644
--- a/benchmark/benchmark_new_env.rb
+++ b/benchmark/benchmark_new_env.rb
@@ -4,6 +4,8 @@ require 'benchmark/ips'
 
 tmpdir = prepare_collection!
 
+(0..30_000_000).map { {} }
+
 Benchmark.ips do |x|
   x.time = 10
```


The results are the following:

```
# Before
Calculating -------------------------------------
             new_env     10.354 (± 9.7%) i/s -    103.000 in  10.013834s
       new_rails_env      1.661 (± 0.0%) i/s -     17.000 in  10.282490s

# After
Calculating -------------------------------------
             new_env     10.771 (± 9.3%) i/s -    107.000 in  10.010446s
       new_rails_env      1.584 (± 0.0%) i/s -     16.000 in  10.178984s
```


### `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO`

The `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO` env var also mitigates the performance impact.
In this example, I set `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.6` (default: 0.20)

```console
# Before
Calculating -------------------------------------
             new_env     10.271 (± 9.7%) i/s -    102.000 in  10.087191s
       new_rails_env      1.529 (± 0.0%) i/s -     16.000 in  10.538043s

# After
$ env RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.6 bundle exec ruby benchmark/benchmark_new_env.rb
Calculating -------------------------------------
             new_env     11.003 (± 9.1%) i/s -    110.000 in  10.068428s
       new_rails_env      1.347 (± 0.0%) i/s -     14.000 in  11.117665s
```


## Additional Information

* I applied the same change to Array. But it does not cause this problem.
  * I guess the cause is the difference of the Size Pool. An empty Array uses 40 bytes like the ordinal Ruby object, but an empty Hash uses 160 bytes.
  * The Size Pool for 160 bytes objects has fewer objects than the 40 bytes one. So, reducing allocation affects the performance sensitively.
* I tried it on Ruby 3.2. This change on Ruby 3.2 does not degrade the execution time.
  * VWA for Hash is introduced since Ruby 3.3. https://github.com/ruby/ruby/blob/73c39a5f93d3ad4514a06158e2bb7622496372b9/doc/NEWS/NEWS-3.3.0.md#gc--memory-management



## Acknowledgement

@mame, @ko1, and @soutaro helped the investigation. I would like to thank them. 



-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:119032] [Ruby master Bug#20710] Reducing Hash allocation introduces large performance degradation (probably related to VWA)
  2024-09-02  6:24 [ruby-core:119000] [Ruby master Bug#20710] Reducing Hash allocation introduces large performance degradation (probably related to VWA) pocke (Masataka Kuwabara) via ruby-core
  2024-09-02 16:49 ` [ruby-core:119015] " peterzhu2118 (Peter Zhu) via ruby-core
  2024-09-02 16:53 ` [ruby-core:119016] " byroot (Jean Boussier) via ruby-core
@ 2024-09-04  3:28 ` mame (Yusuke Endoh) via ruby-core
  2024-09-06 21:50 ` [ruby-core:119092] " peterzhu2118 (Peter Zhu) via ruby-core
  2024-09-09  7:25 ` [ruby-core:119103] " pocke (Masataka Kuwabara) via ruby-core
  4 siblings, 0 replies; 6+ messages in thread
From: mame (Yusuke Endoh) via ruby-core @ 2024-09-04  3:28 UTC (permalink / raw)
  To: ruby-core; +Cc: mame (Yusuke Endoh)

Issue #20710 has been updated by mame (Yusuke Endoh).


@peterzhu2118 This is totally different from the issue we talked about in Slack, which was entirely a micro-benchmark of the speed of generating empty hashes. This problem is much more complex.

This problem is that “reducing the number of object creation can slow down execution time”. If my investigation is correct, the mechanism of the problem is as follows.

* If a program creates more objects, GC often occurs and the heap grows. This decreases the frequency of GC occurrence gradually and and throughput increases.
* If a program creates fewer objects, only minor GCs occur and the heap does not grow. Therefore, minor GCs continue to occur at a high frequency and throughput does not increase.

Therefore, allocating unnecessary objects can be much faster, ironically.

If my understanding is correct, minor GC grows the heap only the first few times, and no matter how many minor GCs occur after that, the heap will not grow. I feel there is room for improvement in this.

I am even unsure if VWA is really involved in this problem. But we have observed this problem with Hash objects and Ruby 3.3. It does not reproduce on Ruby 3.2. We couldn't reproduce this by reducing an object rather than Hash.

Note that this problem is not in a micro-benchmark. It actually occurs in RBS + Steep, a real-world macro benchmark.

----------------------------------------
Bug #20710: Reducing Hash allocation introduces large performance degradation (probably related to VWA)
https://bugs.ruby-lang.org/issues/20710#change-109611

* Author: pocke (Masataka Kuwabara)
* Status: Open
* ruby -v: ruby 3.3.4 (2024-07-09 revision be1089c8ec) [arm64-darwin21]
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
I found a surprising performance degradation while developing RBS.
In short, I tried to remove unnecessary Hash allocations for RBS. Then, it made the execution time 2x slower.

VWA for Hash probably causes this degradation. I'd be happy if we could mitigate the impact by updating the memory management strategy.


## Reproduce

You can reproduce this problem on a PR in pocke/rbs repository.
https://github.com/pocke/rbs/pull/2
This PR dedups empty Hash objects.

1. `git clone` and checkout
1. `bundle install`
1. `bundle exec rake compile` for C-ext
1. `bundle ruby benchmark/benchmark_new_env.rb`

The "before" commit is https://github.com/pocke/rbs/commit/2c356c060286429cfdb034f88a74a6f94420fd21.
The "after" commit is https://github.com/pocke/rbs/commit/bfb2c367c7d3b7f93720392252d3a3980d7bf335.

The benchmark results are the following:

```
# Before
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
             new_env      6.426 (±15.6%) i/s -     64.000 in  10.125442s
       new_rails_env      0.968 (± 0.0%) i/s -     10.000 in  10.355738s

# After
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
             new_env      4.371 (±22.9%) i/s -     43.000 in  10.150192s
       new_rails_env      0.360 (± 0.0%) i/s -      4.000 in  11.313158s
```

The IPS decreased 1.47x for `new_env` case (parsing small RBS env), and 2.69x for `new_rails_env` (parsing large RBS env).


## Investigation

### GC.stat

`GC.stat` indicates the number of minor GCs increases.

```ruby
# In the RBS repository
require_relative './benchmark/utils'

tmpdir = prepare_collection!
new_rails_env(tmpdir)
pp GC.stat
```


```
# before
{:count=>126,
 :time=>541,
 :marking_time=>496,
 :sweeping_time=>45,
 :heap_allocated_pages=>702,
 :heap_sorted_length=>984,
 :heap_allocatable_pages=>282,
 :heap_available_slots=>793270,
 :heap_live_slots=>787407,
 :heap_free_slots=>5863,
 :heap_final_slots=>0,
 :heap_marked_slots=>757744,
 :heap_eden_pages=>702,
 :heap_tomb_pages=>0,
 :total_allocated_pages=>702,
 :total_freed_pages=>0,
 :total_allocated_objects=>2220605,
 :total_freed_objects=>1433198,
 :malloc_increase_bytes=>5872,
 :malloc_increase_bytes_limit=>16777216,
 :minor_gc_count=>112,
 :major_gc_count=>14,
 :compact_count=>0,
 :read_barrier_faults=>0,
 :total_moved_objects=>0,
 :remembered_wb_unprotected_objects=>0,
 :remembered_wb_unprotected_objects_limit=>4779,
 :old_objects=>615704,
 :old_objects_limit=>955872,
 :oldmalloc_increase_bytes=>210912,
 :oldmalloc_increase_bytes_limit=>16777216}

# after
{:count=>255,
 :time=>1551,
 :marking_time=>1496,
 :sweeping_time=>55,
 :heap_allocated_pages=>570,
 :heap_sorted_length=>1038,
 :heap_allocatable_pages=>468,
 :heap_available_slots=>735520,
 :heap_live_slots=>731712,
 :heap_free_slots=>3808,
 :heap_final_slots=>0,
 :heap_marked_slots=>728727,
 :heap_eden_pages=>570,
 :heap_tomb_pages=>0,
 :total_allocated_pages=>570,
 :total_freed_pages=>0,
 :total_allocated_objects=>2183278,
 :total_freed_objects=>1451566,
 :malloc_increase_bytes=>1200,
 :malloc_increase_bytes_limit=>16777216,
 :minor_gc_count=>242,
 :major_gc_count=>13,
 :compact_count=>0,
 :read_barrier_faults=>0,
 :total_moved_objects=>0,
 :remembered_wb_unprotected_objects=>0,
 :remembered_wb_unprotected_objects_limit=>5915,
 :old_objects=>600594,
 :old_objects_limit=>1183070,
 :oldmalloc_increase_bytes=>8128,
 :oldmalloc_increase_bytes_limit=>16777216}
```

### Warming up Hashes

The following patch, which creates unnecessary Hash objects before the benchmark, improves the execution time.


```diff
diff --git a/benchmark/benchmark_new_env.rb b/benchmark/benchmark_new_env.rb
index 6dd2b73f..a8da61c6 100644
--- a/benchmark/benchmark_new_env.rb
+++ b/benchmark/benchmark_new_env.rb
@@ -4,6 +4,8 @@ require 'benchmark/ips'
 
 tmpdir = prepare_collection!
 
+(0..30_000_000).map { {} }
+
 Benchmark.ips do |x|
   x.time = 10
```


The results are the following:

```
# Before
Calculating -------------------------------------
             new_env     10.354 (± 9.7%) i/s -    103.000 in  10.013834s
       new_rails_env      1.661 (± 0.0%) i/s -     17.000 in  10.282490s

# After
Calculating -------------------------------------
             new_env     10.771 (± 9.3%) i/s -    107.000 in  10.010446s
       new_rails_env      1.584 (± 0.0%) i/s -     16.000 in  10.178984s
```


### `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO`

The `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO` env var also mitigates the performance impact.
In this example, I set `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.6` (default: 0.20)

```console
# Before
Calculating -------------------------------------
             new_env     10.271 (± 9.7%) i/s -    102.000 in  10.087191s
       new_rails_env      1.529 (± 0.0%) i/s -     16.000 in  10.538043s

# After
$ env RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.6 bundle exec ruby benchmark/benchmark_new_env.rb
Calculating -------------------------------------
             new_env     11.003 (± 9.1%) i/s -    110.000 in  10.068428s
       new_rails_env      1.347 (± 0.0%) i/s -     14.000 in  11.117665s
```


## Additional Information

* I applied the same change to Array. But it does not cause this problem.
  * I guess the cause is the difference of the Size Pool. An empty Array uses 40 bytes like the ordinal Ruby object, but an empty Hash uses 160 bytes.
  * The Size Pool for 160 bytes objects has fewer objects than the 40 bytes one. So, reducing allocation affects the performance sensitively.
* I tried it on Ruby 3.2. This change on Ruby 3.2 does not degrade the execution time.
  * VWA for Hash is introduced since Ruby 3.3. https://github.com/ruby/ruby/blob/73c39a5f93d3ad4514a06158e2bb7622496372b9/doc/NEWS/NEWS-3.3.0.md#gc--memory-management



## Acknowledgement

@mame, @ko1, and @soutaro helped the investigation. I would like to thank them. 



-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:119092] [Ruby master Bug#20710] Reducing Hash allocation introduces large performance degradation (probably related to VWA)
  2024-09-02  6:24 [ruby-core:119000] [Ruby master Bug#20710] Reducing Hash allocation introduces large performance degradation (probably related to VWA) pocke (Masataka Kuwabara) via ruby-core
                   ` (2 preceding siblings ...)
  2024-09-04  3:28 ` [ruby-core:119032] " mame (Yusuke Endoh) via ruby-core
@ 2024-09-06 21:50 ` peterzhu2118 (Peter Zhu) via ruby-core
  2024-09-09  7:25 ` [ruby-core:119103] " pocke (Masataka Kuwabara) via ruby-core
  4 siblings, 0 replies; 6+ messages in thread
From: peterzhu2118 (Peter Zhu) via ruby-core @ 2024-09-06 21:50 UTC (permalink / raw)
  To: ruby-core; +Cc: peterzhu2118 (Peter Zhu)

Issue #20710 has been updated by peterzhu2118 (Peter Zhu).


I implemented @byroot's suggestion in this PR: https://github.com/ruby/ruby/pull/11562

It significantly improves the performance in your benchmark and brings it almost as fast as Ruby 3.2.

----------------------------------------
Bug #20710: Reducing Hash allocation introduces large performance degradation (probably related to VWA)
https://bugs.ruby-lang.org/issues/20710#change-109679

* Author: pocke (Masataka Kuwabara)
* Status: Open
* ruby -v: ruby 3.3.4 (2024-07-09 revision be1089c8ec) [arm64-darwin21]
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
I found a surprising performance degradation while developing RBS.
In short, I tried to remove unnecessary Hash allocations for RBS. Then, it made the execution time 2x slower.

VWA for Hash probably causes this degradation. I'd be happy if we could mitigate the impact by updating the memory management strategy.


## Reproduce

You can reproduce this problem on a PR in pocke/rbs repository.
https://github.com/pocke/rbs/pull/2
This PR dedups empty Hash objects.

1. `git clone` and checkout
1. `bundle install`
1. `bundle exec rake compile` for C-ext
1. `bundle ruby benchmark/benchmark_new_env.rb`

The "before" commit is https://github.com/pocke/rbs/commit/2c356c060286429cfdb034f88a74a6f94420fd21.
The "after" commit is https://github.com/pocke/rbs/commit/bfb2c367c7d3b7f93720392252d3a3980d7bf335.

The benchmark results are the following:

```
# Before
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
             new_env      6.426 (±15.6%) i/s -     64.000 in  10.125442s
       new_rails_env      0.968 (± 0.0%) i/s -     10.000 in  10.355738s

# After
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
             new_env      4.371 (±22.9%) i/s -     43.000 in  10.150192s
       new_rails_env      0.360 (± 0.0%) i/s -      4.000 in  11.313158s
```

The IPS decreased 1.47x for `new_env` case (parsing small RBS env), and 2.69x for `new_rails_env` (parsing large RBS env).


## Investigation

### GC.stat

`GC.stat` indicates the number of minor GCs increases.

```ruby
# In the RBS repository
require_relative './benchmark/utils'

tmpdir = prepare_collection!
new_rails_env(tmpdir)
pp GC.stat
```


```
# before
{:count=>126,
 :time=>541,
 :marking_time=>496,
 :sweeping_time=>45,
 :heap_allocated_pages=>702,
 :heap_sorted_length=>984,
 :heap_allocatable_pages=>282,
 :heap_available_slots=>793270,
 :heap_live_slots=>787407,
 :heap_free_slots=>5863,
 :heap_final_slots=>0,
 :heap_marked_slots=>757744,
 :heap_eden_pages=>702,
 :heap_tomb_pages=>0,
 :total_allocated_pages=>702,
 :total_freed_pages=>0,
 :total_allocated_objects=>2220605,
 :total_freed_objects=>1433198,
 :malloc_increase_bytes=>5872,
 :malloc_increase_bytes_limit=>16777216,
 :minor_gc_count=>112,
 :major_gc_count=>14,
 :compact_count=>0,
 :read_barrier_faults=>0,
 :total_moved_objects=>0,
 :remembered_wb_unprotected_objects=>0,
 :remembered_wb_unprotected_objects_limit=>4779,
 :old_objects=>615704,
 :old_objects_limit=>955872,
 :oldmalloc_increase_bytes=>210912,
 :oldmalloc_increase_bytes_limit=>16777216}

# after
{:count=>255,
 :time=>1551,
 :marking_time=>1496,
 :sweeping_time=>55,
 :heap_allocated_pages=>570,
 :heap_sorted_length=>1038,
 :heap_allocatable_pages=>468,
 :heap_available_slots=>735520,
 :heap_live_slots=>731712,
 :heap_free_slots=>3808,
 :heap_final_slots=>0,
 :heap_marked_slots=>728727,
 :heap_eden_pages=>570,
 :heap_tomb_pages=>0,
 :total_allocated_pages=>570,
 :total_freed_pages=>0,
 :total_allocated_objects=>2183278,
 :total_freed_objects=>1451566,
 :malloc_increase_bytes=>1200,
 :malloc_increase_bytes_limit=>16777216,
 :minor_gc_count=>242,
 :major_gc_count=>13,
 :compact_count=>0,
 :read_barrier_faults=>0,
 :total_moved_objects=>0,
 :remembered_wb_unprotected_objects=>0,
 :remembered_wb_unprotected_objects_limit=>5915,
 :old_objects=>600594,
 :old_objects_limit=>1183070,
 :oldmalloc_increase_bytes=>8128,
 :oldmalloc_increase_bytes_limit=>16777216}
```

### Warming up Hashes

The following patch, which creates unnecessary Hash objects before the benchmark, improves the execution time.


```diff
diff --git a/benchmark/benchmark_new_env.rb b/benchmark/benchmark_new_env.rb
index 6dd2b73f..a8da61c6 100644
--- a/benchmark/benchmark_new_env.rb
+++ b/benchmark/benchmark_new_env.rb
@@ -4,6 +4,8 @@ require 'benchmark/ips'
 
 tmpdir = prepare_collection!
 
+(0..30_000_000).map { {} }
+
 Benchmark.ips do |x|
   x.time = 10
```


The results are the following:

```
# Before
Calculating -------------------------------------
             new_env     10.354 (± 9.7%) i/s -    103.000 in  10.013834s
       new_rails_env      1.661 (± 0.0%) i/s -     17.000 in  10.282490s

# After
Calculating -------------------------------------
             new_env     10.771 (± 9.3%) i/s -    107.000 in  10.010446s
       new_rails_env      1.584 (± 0.0%) i/s -     16.000 in  10.178984s
```


### `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO`

The `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO` env var also mitigates the performance impact.
In this example, I set `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.6` (default: 0.20)

```console
# Before
Calculating -------------------------------------
             new_env     10.271 (± 9.7%) i/s -    102.000 in  10.087191s
       new_rails_env      1.529 (± 0.0%) i/s -     16.000 in  10.538043s

# After
$ env RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.6 bundle exec ruby benchmark/benchmark_new_env.rb
Calculating -------------------------------------
             new_env     11.003 (± 9.1%) i/s -    110.000 in  10.068428s
       new_rails_env      1.347 (± 0.0%) i/s -     14.000 in  11.117665s
```


## Additional Information

* I applied the same change to Array. But it does not cause this problem.
  * I guess the cause is the difference of the Size Pool. An empty Array uses 40 bytes like the ordinal Ruby object, but an empty Hash uses 160 bytes.
  * The Size Pool for 160 bytes objects has fewer objects than the 40 bytes one. So, reducing allocation affects the performance sensitively.
* I tried it on Ruby 3.2. This change on Ruby 3.2 does not degrade the execution time.
  * VWA for Hash is introduced since Ruby 3.3. https://github.com/ruby/ruby/blob/73c39a5f93d3ad4514a06158e2bb7622496372b9/doc/NEWS/NEWS-3.3.0.md#gc--memory-management



## Acknowledgement

@mame, @ko1, and @soutaro helped the investigation. I would like to thank them. 



-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:119103] [Ruby master Bug#20710] Reducing Hash allocation introduces large performance degradation (probably related to VWA)
  2024-09-02  6:24 [ruby-core:119000] [Ruby master Bug#20710] Reducing Hash allocation introduces large performance degradation (probably related to VWA) pocke (Masataka Kuwabara) via ruby-core
                   ` (3 preceding siblings ...)
  2024-09-06 21:50 ` [ruby-core:119092] " peterzhu2118 (Peter Zhu) via ruby-core
@ 2024-09-09  7:25 ` pocke (Masataka Kuwabara) via ruby-core
  4 siblings, 0 replies; 6+ messages in thread
From: pocke (Masataka Kuwabara) via ruby-core @ 2024-09-09  7:25 UTC (permalink / raw)
  To: ruby-core; +Cc: pocke (Masataka Kuwabara)

Issue #20710 has been updated by pocke (Masataka Kuwabara).


@peterzhu2118 Thanks for your work! I've confirmed this PR improves the performance in my environment too.

----------------------------------------
Bug #20710: Reducing Hash allocation introduces large performance degradation (probably related to VWA)
https://bugs.ruby-lang.org/issues/20710#change-109691

* Author: pocke (Masataka Kuwabara)
* Status: Open
* ruby -v: ruby 3.3.4 (2024-07-09 revision be1089c8ec) [arm64-darwin21]
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
I found a surprising performance degradation while developing RBS.
In short, I tried to remove unnecessary Hash allocations for RBS. Then, it made the execution time 2x slower.

VWA for Hash probably causes this degradation. I'd be happy if we could mitigate the impact by updating the memory management strategy.


## Reproduce

You can reproduce this problem on a PR in pocke/rbs repository.
https://github.com/pocke/rbs/pull/2
This PR dedups empty Hash objects.

1. `git clone` and checkout
1. `bundle install`
1. `bundle exec rake compile` for C-ext
1. `bundle ruby benchmark/benchmark_new_env.rb`

The "before" commit is https://github.com/pocke/rbs/commit/2c356c060286429cfdb034f88a74a6f94420fd21.
The "after" commit is https://github.com/pocke/rbs/commit/bfb2c367c7d3b7f93720392252d3a3980d7bf335.

The benchmark results are the following:

```
# Before
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
             new_env      6.426 (±15.6%) i/s -     64.000 in  10.125442s
       new_rails_env      0.968 (± 0.0%) i/s -     10.000 in  10.355738s

# After
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
             new_env      4.371 (±22.9%) i/s -     43.000 in  10.150192s
       new_rails_env      0.360 (± 0.0%) i/s -      4.000 in  11.313158s
```

The IPS decreased 1.47x for `new_env` case (parsing small RBS env), and 2.69x for `new_rails_env` (parsing large RBS env).


## Investigation

### GC.stat

`GC.stat` indicates the number of minor GCs increases.

```ruby
# In the RBS repository
require_relative './benchmark/utils'

tmpdir = prepare_collection!
new_rails_env(tmpdir)
pp GC.stat
```


```
# before
{:count=>126,
 :time=>541,
 :marking_time=>496,
 :sweeping_time=>45,
 :heap_allocated_pages=>702,
 :heap_sorted_length=>984,
 :heap_allocatable_pages=>282,
 :heap_available_slots=>793270,
 :heap_live_slots=>787407,
 :heap_free_slots=>5863,
 :heap_final_slots=>0,
 :heap_marked_slots=>757744,
 :heap_eden_pages=>702,
 :heap_tomb_pages=>0,
 :total_allocated_pages=>702,
 :total_freed_pages=>0,
 :total_allocated_objects=>2220605,
 :total_freed_objects=>1433198,
 :malloc_increase_bytes=>5872,
 :malloc_increase_bytes_limit=>16777216,
 :minor_gc_count=>112,
 :major_gc_count=>14,
 :compact_count=>0,
 :read_barrier_faults=>0,
 :total_moved_objects=>0,
 :remembered_wb_unprotected_objects=>0,
 :remembered_wb_unprotected_objects_limit=>4779,
 :old_objects=>615704,
 :old_objects_limit=>955872,
 :oldmalloc_increase_bytes=>210912,
 :oldmalloc_increase_bytes_limit=>16777216}

# after
{:count=>255,
 :time=>1551,
 :marking_time=>1496,
 :sweeping_time=>55,
 :heap_allocated_pages=>570,
 :heap_sorted_length=>1038,
 :heap_allocatable_pages=>468,
 :heap_available_slots=>735520,
 :heap_live_slots=>731712,
 :heap_free_slots=>3808,
 :heap_final_slots=>0,
 :heap_marked_slots=>728727,
 :heap_eden_pages=>570,
 :heap_tomb_pages=>0,
 :total_allocated_pages=>570,
 :total_freed_pages=>0,
 :total_allocated_objects=>2183278,
 :total_freed_objects=>1451566,
 :malloc_increase_bytes=>1200,
 :malloc_increase_bytes_limit=>16777216,
 :minor_gc_count=>242,
 :major_gc_count=>13,
 :compact_count=>0,
 :read_barrier_faults=>0,
 :total_moved_objects=>0,
 :remembered_wb_unprotected_objects=>0,
 :remembered_wb_unprotected_objects_limit=>5915,
 :old_objects=>600594,
 :old_objects_limit=>1183070,
 :oldmalloc_increase_bytes=>8128,
 :oldmalloc_increase_bytes_limit=>16777216}
```

### Warming up Hashes

The following patch, which creates unnecessary Hash objects before the benchmark, improves the execution time.


```diff
diff --git a/benchmark/benchmark_new_env.rb b/benchmark/benchmark_new_env.rb
index 6dd2b73f..a8da61c6 100644
--- a/benchmark/benchmark_new_env.rb
+++ b/benchmark/benchmark_new_env.rb
@@ -4,6 +4,8 @@ require 'benchmark/ips'
 
 tmpdir = prepare_collection!
 
+(0..30_000_000).map { {} }
+
 Benchmark.ips do |x|
   x.time = 10
```


The results are the following:

```
# Before
Calculating -------------------------------------
             new_env     10.354 (± 9.7%) i/s -    103.000 in  10.013834s
       new_rails_env      1.661 (± 0.0%) i/s -     17.000 in  10.282490s

# After
Calculating -------------------------------------
             new_env     10.771 (± 9.3%) i/s -    107.000 in  10.010446s
       new_rails_env      1.584 (± 0.0%) i/s -     16.000 in  10.178984s
```


### `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO`

The `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO` env var also mitigates the performance impact.
In this example, I set `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.6` (default: 0.20)

```console
# Before
Calculating -------------------------------------
             new_env     10.271 (± 9.7%) i/s -    102.000 in  10.087191s
       new_rails_env      1.529 (± 0.0%) i/s -     16.000 in  10.538043s

# After
$ env RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.6 bundle exec ruby benchmark/benchmark_new_env.rb
Calculating -------------------------------------
             new_env     11.003 (± 9.1%) i/s -    110.000 in  10.068428s
       new_rails_env      1.347 (± 0.0%) i/s -     14.000 in  11.117665s
```


## Additional Information

* I applied the same change to Array. But it does not cause this problem.
  * I guess the cause is the difference of the Size Pool. An empty Array uses 40 bytes like the ordinal Ruby object, but an empty Hash uses 160 bytes.
  * The Size Pool for 160 bytes objects has fewer objects than the 40 bytes one. So, reducing allocation affects the performance sensitively.
* I tried it on Ruby 3.2. This change on Ruby 3.2 does not degrade the execution time.
  * VWA for Hash is introduced since Ruby 3.3. https://github.com/ruby/ruby/blob/73c39a5f93d3ad4514a06158e2bb7622496372b9/doc/NEWS/NEWS-3.3.0.md#gc--memory-management



## Acknowledgement

@mame, @ko1, and @soutaro helped the investigation. I would like to thank them. 



-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-09-09  7:26 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-02  6:24 [ruby-core:119000] [Ruby master Bug#20710] Reducing Hash allocation introduces large performance degradation (probably related to VWA) pocke (Masataka Kuwabara) via ruby-core
2024-09-02 16:49 ` [ruby-core:119015] " peterzhu2118 (Peter Zhu) via ruby-core
2024-09-02 16:53 ` [ruby-core:119016] " byroot (Jean Boussier) via ruby-core
2024-09-04  3:28 ` [ruby-core:119032] " mame (Yusuke Endoh) via ruby-core
2024-09-06 21:50 ` [ruby-core:119092] " peterzhu2118 (Peter Zhu) via ruby-core
2024-09-09  7:25 ` [ruby-core:119103] " pocke (Masataka Kuwabara) via ruby-core

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).