* [ruby-core:118346] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
@ 2024-06-19 10:10 ivoanjo (Ivo Anjo) via ruby-core
2024-06-19 10:17 ` [ruby-core:118348] " ivoanjo (Ivo Anjo) via ruby-core
` (10 more replies)
0 siblings, 11 replies; 13+ messages in thread
From: ivoanjo (Ivo Anjo) via ruby-core @ 2024-06-19 10:10 UTC (permalink / raw)
To: ruby-core; +Cc: ivoanjo (Ivo Anjo)
Issue #20586 has been reported by ivoanjo (Ivo Anjo).
----------------------------------------
Bug #20586: Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
https://bugs.ruby-lang.org/issues/20586
* Author: ivoanjo (Ivo Anjo)
* Status: Open
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
#### Background
Hey! I work for Datadog on the Ruby profiler part of the [`datadog` (previously `ddtrace`)](https://github.com/datadog/dd-trace-rb) gem.
A customer reached [out with an issue](https://github.com/DataDog/dd-trace-rb/issues/3450) where enabling the profiler made `Dir.glob` return no files for a given folder:
Without profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> ["/gcsfuse/test.html", "/gcsfuse/test.txt"]
```
With profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> []
```
It turns out the issue is related to missing error handling in `dir.c`.
The Datadog Ruby profiler, like stackprof, pf2 and vernier, uses unix signals to interrupt the currently-active thread and take a sample (usually `SIGPROF`). When some system calls get interrupted by a signal, they return an [EINTR error code](https://man7.org/linux/man-pages/man7/signal.7.html#:~:text=Interruption%20of%20system%20calls%20and%20library%20functions%20by%20signal%20handlers) back to the caller.
Consider for instance the implementation of `dir_each_entry` in `dir.c`:
```c
static VALUE
dir_each_entry(VALUE dir, VALUE (*each)(VALUE, VALUE), VALUE arg, int children_only)
{
struct dir_data *dirp;
struct dirent *dp;
IF_NORMALIZE_UTF8PATH(int norm_p);
GetDIR(dir, dirp);
rewinddir(dirp->dir);
IF_NORMALIZE_UTF8PATH(norm_p = need_normalization(dirp->dir, RSTRING_PTR(dirp->path)));
while ((dp = READDIR(dirp->dir, dirp->enc)) != NULL) {
// ... do things
}
return dir;
}
```
If `READDIR` returns `NULL`, then `dir_each_entry` assumes it has iterated the entire directory. But looking [at the man page for `readdir`](https://man7.org/linux/man-pages/man3/readdir.3.html) we see the following sharp edge (emphasis mine):
> It returns NULL on reaching the end of the directory stream **or if an error occurred**.
So what's happening in this situation is: `readdir` gets interrupted, returns `NULL` + sets errno to `EINTR`. But `dir_each_entry` doesn't check `errno`, so rather than raising an exception to flag the issue, it treats it as if the end of the directory has been reached.
#### How to reproduce
Reproducing this is somewhat annoying, because it's dependent on timing: the signal must arrive at the exact time the dir API is getting executed.
I was able to reproduce this every time by using the [google cloud `gcsfuse`](https://cloud.google.com/storage/docs/gcs-fuse) tool. This somewhat makes sense -- a remote filesystem is much slower than a local one, so there's a much bigger window of opportunity for a signal to arrive while the system call is blocked.
Here's an example I included in https://github.com/DataDog/dd-trace-rb/pull/3720:
```
# Not shown: Set up a trial google cloud account, install gcsfuse, create a cloud storage bucket and put in some test files
$ gcsfuse test_fs_dd_trace_rb fuse-testing/
$ ls fuse-testing/
hello.txt test.html test.txt
# Not shown: Add `datadog` gem to `Gemfile`
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
[]
```
Let me know if you'd like me to try to create a reproducer that does not depend on the `datadog` gem.
#### Additional notes
I've spent quite some time looking at the `dir.c` sources, and here's the full list of APIs that suffer from issues:
* `dir_each_entry` does not check `errno`; all of its users have interruption bugs
* `dir_tell` will return -1 instead of the correct position (which means that passing -1 to `dir_seek`/`dir_set_pos` will cause it to not list the directory properly)
* `do_opendir` an error in system calls will only sometimes be turned into a raised exception
* Indirect callers that pass in rb_glob_error as errfunc: rb_glob, Dir.[], Dir.glob
* Indirect callers that pass in 0 as errfunc: ruby_glob, ruby_brace_glob
* `glob_opendir` does not check errno; all of its users have interruption bugs
* `glob_getent` does not check errno; all of its users have interruption bugs
* `nogvl_dir_empty_p` does not check errno (of readdir! it actually checks for opendir); all of its users have interruption bugs
Less sure about these:
* `do_stat`/`do_lstat` will turn errors into warnings (unclear if enabled or disabled by default)
* `need_normalization` calls `fgetattrlist` / `getattrlist` and all errors `(ret != 0)` are treated in the same way
* `rb_glob_error` is and `rb_glob_caller` leave exceptions as pending and rely on callers to raise them properly
* Error handling of `rb_home_dir_of` and `rb_default_home_dir` are a bit suspicious
As a workaround in the Datadog Ruby profiler, in https://github.com/DataDog/dd-trace-rb/pull/3720 I've added monkey patches to all of the Ruby-level APIs that use the above functions and mask out `SIGPROF` so these calls are never interrupted.
This solution is does successfully work around the issue, although it prevents the profiler from sampling during these system calls, which will mean less visibility if e.g. these calls are taking a long time. And well, maintaining monkey patches is always problematic for future Ruby compatibility.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* [ruby-core:118348] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
2024-06-19 10:10 [ruby-core:118346] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted ivoanjo (Ivo Anjo) via ruby-core
@ 2024-06-19 10:17 ` ivoanjo (Ivo Anjo) via ruby-core
2024-06-19 11:11 ` [ruby-core:118350] " mame (Yusuke Endoh) via ruby-core
` (9 subsequent siblings)
10 siblings, 0 replies; 13+ messages in thread
From: ivoanjo (Ivo Anjo) via ruby-core @ 2024-06-19 10:17 UTC (permalink / raw)
To: ruby-core; +Cc: ivoanjo (Ivo Anjo)
Issue #20586 has been updated by ivoanjo (Ivo Anjo).
There's a "related" issue in dir.c which is that it sometimes blocking system calls are performed while Ruby is still holding the GVL, thus blocking the entire VM. I've filed a separate ticket for that -- https://bugs.ruby-lang.org/issues/20587 .
----------------------------------------
Bug #20586: Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
https://bugs.ruby-lang.org/issues/20586#change-108852
* Author: ivoanjo (Ivo Anjo)
* Status: Open
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
#### Background
Hey! I work for Datadog on the Ruby profiler part of the [`datadog` (previously `ddtrace`)](https://github.com/datadog/dd-trace-rb) gem.
A customer reached [out with an issue](https://github.com/DataDog/dd-trace-rb/issues/3450) where enabling the profiler made `Dir.glob` return no files for a given folder:
Without profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> ["/gcsfuse/test.html", "/gcsfuse/test.txt"]
```
With profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> []
```
It turns out the issue is related to missing error handling in `dir.c`.
The Datadog Ruby profiler, like stackprof, pf2 and vernier, uses unix signals to interrupt the currently-active thread and take a sample (usually `SIGPROF`). When some system calls get interrupted by a signal, they return an [EINTR error code](https://man7.org/linux/man-pages/man7/signal.7.html#:~:text=Interruption%20of%20system%20calls%20and%20library%20functions%20by%20signal%20handlers) back to the caller.
Consider for instance the implementation of `dir_each_entry` in `dir.c`:
```c
static VALUE
dir_each_entry(VALUE dir, VALUE (*each)(VALUE, VALUE), VALUE arg, int children_only)
{
struct dir_data *dirp;
struct dirent *dp;
IF_NORMALIZE_UTF8PATH(int norm_p);
GetDIR(dir, dirp);
rewinddir(dirp->dir);
IF_NORMALIZE_UTF8PATH(norm_p = need_normalization(dirp->dir, RSTRING_PTR(dirp->path)));
while ((dp = READDIR(dirp->dir, dirp->enc)) != NULL) {
// ... do things
}
return dir;
}
```
If `READDIR` returns `NULL`, then `dir_each_entry` assumes it has iterated the entire directory. But looking [at the man page for `readdir`](https://man7.org/linux/man-pages/man3/readdir.3.html) we see the following sharp edge (emphasis mine):
> It returns NULL on reaching the end of the directory stream **or if an error occurred**.
So what's happening in this situation is: `readdir` gets interrupted, returns `NULL` + sets errno to `EINTR`. But `dir_each_entry` doesn't check `errno`, so rather than raising an exception to flag the issue, it treats it as if the end of the directory has been reached.
#### How to reproduce
Reproducing this is somewhat annoying, because it's dependent on timing: the signal must arrive at the exact time the dir API is getting executed.
I was able to reproduce this every time by using the [google cloud `gcsfuse`](https://cloud.google.com/storage/docs/gcs-fuse) tool. This somewhat makes sense -- a remote filesystem is much slower than a local one, so there's a much bigger window of opportunity for a signal to arrive while the system call is blocked.
Here's an example I included in https://github.com/DataDog/dd-trace-rb/pull/3720:
```
# Not shown: Set up a trial google cloud account, install gcsfuse, create a cloud storage bucket and put in some test files
$ gcsfuse test_fs_dd_trace_rb fuse-testing/
$ ls fuse-testing/
hello.txt test.html test.txt
# Not shown: Add `datadog` gem to `Gemfile`
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
[]
```
Let me know if you'd like me to try to create a reproducer that does not depend on the `datadog` gem.
#### Additional notes
I've spent quite some time looking at the `dir.c` sources, and here's the full list of APIs that suffer from issues:
* `dir_each_entry` does not check `errno`; all of its users have interruption bugs
* `dir_tell` will return -1 instead of the correct position (which means that passing -1 to `dir_seek`/`dir_set_pos` will cause it to not list the directory properly)
* `do_opendir` an error in system calls will only sometimes be turned into a raised exception
* Indirect callers that pass in rb_glob_error as errfunc: rb_glob, Dir.[], Dir.glob
* Indirect callers that pass in 0 as errfunc: ruby_glob, ruby_brace_glob
* `glob_opendir` does not check errno; all of its users have interruption bugs
* `glob_getent` does not check errno; all of its users have interruption bugs
* `nogvl_dir_empty_p` does not check errno (of readdir! it actually checks for opendir); all of its users have interruption bugs
Less sure about these:
* `do_stat`/`do_lstat` will turn errors into warnings (unclear if enabled or disabled by default)
* `need_normalization` calls `fgetattrlist` / `getattrlist` and all errors `(ret != 0)` are treated in the same way
* `rb_glob_error` is and `rb_glob_caller` leave exceptions as pending and rely on callers to raise them properly
* Error handling of `rb_home_dir_of` and `rb_default_home_dir` are a bit suspicious
As a workaround in the Datadog Ruby profiler, in https://github.com/DataDog/dd-trace-rb/pull/3720 I've added monkey patches to all of the Ruby-level APIs that use the above functions and mask out `SIGPROF` so these calls are never interrupted.
This solution is does successfully work around the issue, although it prevents the profiler from sampling during these system calls, which will mean less visibility if e.g. these calls are taking a long time. And well, maintaining monkey patches is always problematic for future Ruby compatibility.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* [ruby-core:118350] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
2024-06-19 10:10 [ruby-core:118346] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted ivoanjo (Ivo Anjo) via ruby-core
2024-06-19 10:17 ` [ruby-core:118348] " ivoanjo (Ivo Anjo) via ruby-core
@ 2024-06-19 11:11 ` mame (Yusuke Endoh) via ruby-core
2024-06-19 14:04 ` [ruby-core:118354] " ivoanjo (Ivo Anjo) via ruby-core
` (8 subsequent siblings)
10 siblings, 0 replies; 13+ messages in thread
From: mame (Yusuke Endoh) via ruby-core @ 2024-06-19 11:11 UTC (permalink / raw)
To: ruby-core; +Cc: mame (Yusuke Endoh)
Issue #20586 has been updated by mame (Yusuke Endoh).
How would you like to fix it?
IMO, it would be reasonable to have `Dir.glob` raise an exception when `readdir` fails. [The spec of readdir](https://pubs.opengroup.org/onlinepubs/009604599/functions/readdir.html) says:
> Applications wishing to check for error situations should set errno to 0 before calling readdir(). If errno is set to non-zero on return, an error occurred.
Do you want Ruby to automatically retry when readdir returns `EINTR`? If so, I am unsure because neither [the manpage you linked](https://man7.org/linux/man-pages/man3/readdir.3.html) nor [the spec](https://pubs.opengroup.org/onlinepubs/009604599/functions/readdir.html) indicate that readdir may fail with `EINTR`.
Have you actually confirmed that readdir is returning `EINTR`? If so, it could be a bug in the OS. However, if such a bug exists in major platforms, it may be unavoidable for Ruby to handle it.
----------------------------------------
Bug #20586: Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
https://bugs.ruby-lang.org/issues/20586#change-108855
* Author: ivoanjo (Ivo Anjo)
* Status: Open
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
#### Background
Hey! I work for Datadog on the Ruby profiler part of the [`datadog` (previously `ddtrace`)](https://github.com/datadog/dd-trace-rb) gem.
A customer reached [out with an issue](https://github.com/DataDog/dd-trace-rb/issues/3450) where enabling the profiler made `Dir.glob` return no files for a given folder:
Without profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> ["/gcsfuse/test.html", "/gcsfuse/test.txt"]
```
With profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> []
```
It turns out the issue is related to missing error handling in `dir.c`.
The Datadog Ruby profiler, like stackprof, pf2 and vernier, uses unix signals to interrupt the currently-active thread and take a sample (usually `SIGPROF`). When some system calls get interrupted by a signal, they return an [EINTR error code](https://man7.org/linux/man-pages/man7/signal.7.html#:~:text=Interruption%20of%20system%20calls%20and%20library%20functions%20by%20signal%20handlers) back to the caller.
Consider for instance the implementation of `dir_each_entry` in `dir.c`:
```c
static VALUE
dir_each_entry(VALUE dir, VALUE (*each)(VALUE, VALUE), VALUE arg, int children_only)
{
struct dir_data *dirp;
struct dirent *dp;
IF_NORMALIZE_UTF8PATH(int norm_p);
GetDIR(dir, dirp);
rewinddir(dirp->dir);
IF_NORMALIZE_UTF8PATH(norm_p = need_normalization(dirp->dir, RSTRING_PTR(dirp->path)));
while ((dp = READDIR(dirp->dir, dirp->enc)) != NULL) {
// ... do things
}
return dir;
}
```
If `READDIR` returns `NULL`, then `dir_each_entry` assumes it has iterated the entire directory. But looking [at the man page for `readdir`](https://man7.org/linux/man-pages/man3/readdir.3.html) we see the following sharp edge (emphasis mine):
> It returns NULL on reaching the end of the directory stream **or if an error occurred**.
So what's happening in this situation is: `readdir` gets interrupted, returns `NULL` + sets errno to `EINTR`. But `dir_each_entry` doesn't check `errno`, so rather than raising an exception to flag the issue, it treats it as if the end of the directory has been reached.
#### How to reproduce
Reproducing this is somewhat annoying, because it's dependent on timing: the signal must arrive at the exact time the dir API is getting executed.
I was able to reproduce this every time by using the [google cloud `gcsfuse`](https://cloud.google.com/storage/docs/gcs-fuse) tool. This somewhat makes sense -- a remote filesystem is much slower than a local one, so there's a much bigger window of opportunity for a signal to arrive while the system call is blocked.
Here's an example I included in https://github.com/DataDog/dd-trace-rb/pull/3720:
```
# Not shown: Set up a trial google cloud account, install gcsfuse, create a cloud storage bucket and put in some test files
$ gcsfuse test_fs_dd_trace_rb fuse-testing/
$ ls fuse-testing/
hello.txt test.html test.txt
# Not shown: Add `datadog` gem to `Gemfile`
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
[]
```
Let me know if you'd like me to try to create a reproducer that does not depend on the `datadog` gem.
#### Additional notes
I've spent quite some time looking at the `dir.c` sources, and here's the full list of APIs that suffer from issues:
* `dir_each_entry` does not check `errno`; all of its users have interruption bugs
* `dir_tell` will return -1 instead of the correct position (which means that passing -1 to `dir_seek`/`dir_set_pos` will cause it to not list the directory properly)
* `do_opendir` an error in system calls will only sometimes be turned into a raised exception
* Indirect callers that pass in rb_glob_error as errfunc: rb_glob, Dir.[], Dir.glob
* Indirect callers that pass in 0 as errfunc: ruby_glob, ruby_brace_glob
* `glob_opendir` does not check errno; all of its users have interruption bugs
* `glob_getent` does not check errno; all of its users have interruption bugs
* `nogvl_dir_empty_p` does not check errno (of readdir! it actually checks for opendir); all of its users have interruption bugs
Less sure about these:
* `do_stat`/`do_lstat` will turn errors into warnings (unclear if enabled or disabled by default)
* `need_normalization` calls `fgetattrlist` / `getattrlist` and all errors `(ret != 0)` are treated in the same way
* `rb_glob_error` is and `rb_glob_caller` leave exceptions as pending and rely on callers to raise them properly
* Error handling of `rb_home_dir_of` and `rb_default_home_dir` are a bit suspicious
As a workaround in the Datadog Ruby profiler, in https://github.com/DataDog/dd-trace-rb/pull/3720 I've added monkey patches to all of the Ruby-level APIs that use the above functions and mask out `SIGPROF` so these calls are never interrupted.
This solution is does successfully work around the issue, although it prevents the profiler from sampling during these system calls, which will mean less visibility if e.g. these calls are taking a long time. And well, maintaining monkey patches is always problematic for future Ruby compatibility.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* [ruby-core:118354] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
2024-06-19 10:10 [ruby-core:118346] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted ivoanjo (Ivo Anjo) via ruby-core
2024-06-19 10:17 ` [ruby-core:118348] " ivoanjo (Ivo Anjo) via ruby-core
2024-06-19 11:11 ` [ruby-core:118350] " mame (Yusuke Endoh) via ruby-core
@ 2024-06-19 14:04 ` ivoanjo (Ivo Anjo) via ruby-core
2024-06-19 19:57 ` [ruby-core:118358] " Eregon (Benoit Daloze) via ruby-core
` (7 subsequent siblings)
10 siblings, 0 replies; 13+ messages in thread
From: ivoanjo (Ivo Anjo) via ruby-core @ 2024-06-19 14:04 UTC (permalink / raw)
To: ruby-core; +Cc: ivoanjo (Ivo Anjo)
Issue #20586 has been updated by ivoanjo (Ivo Anjo).
File readdir-bug-repro.c added
mame (Yusuke Endoh) wrote in #note-2:
> How would you like to fix it?
>
> IMO, it would be reasonable to have `Dir.glob` raise an exception when `readdir` fails. [The spec of readdir](https://pubs.opengroup.org/onlinepubs/009604599/functions/readdir.html) says:
>
> > Applications wishing to check for error situations should set errno to 0 before calling readdir(). If errno is set to non-zero on return, an error occurred.
>
> Do you want Ruby to automatically retry when readdir returns `EINTR`? If so, I am unsure because neither [the manpage you linked](https://man7.org/linux/man-pages/man3/readdir.3.html) nor [the spec](https://pubs.opengroup.org/onlinepubs/009604599/functions/readdir.html) indicate that readdir may fail with `EINTR`.
I believe the correct thing to do is to raise an exception. This will mean profiling the application could lead to these exceptions showing up where they didn't before, but at least they're an indication that something happened, rather than an incorrect result.
We should avoid retrying, or at least retrying forever, because if some operation always takes 100ms, and the profiler executes every 10ms, then it will keep interrupting the operation and the app will also get stuck.
>
> Have you actually confirmed that readdir is returning `EINTR`?
I have! I have attached a pure-C reproducer, and when I run it with the gcsfuse folder I get:
```
$ gcc readdir-bug-repro.c -o readdir-bug-repro -lpthread && ./readdir-bug-repro fuse-testing/
Set up signal handler!
Received 113 signals, calling readdir...
Failed to read directory 'fuse-testing/': Interrupted system call
```
For reference, I'm running it on:
* Linux rubyshade 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
* Ubuntu 22.04.4 LTS
* gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
> If so, it could be a bug in the OS. However, if such a bug exists in major platforms, it may be unavoidable for Ruby to handle it.
This is a good question. I'll be honest that I could never reproduce with a local filesystem. I'm not sure if this just means that some filesystems (perhaps those implemented with [FUSE](https://www.kernel.org/doc/html/next/filesystems/fuse.html)?) allow this kind of failure, or if it's just really hard to hit it when there's kernel cache and all those things making sure these calls are very speedy.
----------------------------------------
Bug #20586: Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
https://bugs.ruby-lang.org/issues/20586#change-108860
* Author: ivoanjo (Ivo Anjo)
* Status: Open
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
#### Background
Hey! I work for Datadog on the Ruby profiler part of the [`datadog` (previously `ddtrace`)](https://github.com/datadog/dd-trace-rb) gem.
A customer reached [out with an issue](https://github.com/DataDog/dd-trace-rb/issues/3450) where enabling the profiler made `Dir.glob` return no files for a given folder:
Without profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> ["/gcsfuse/test.html", "/gcsfuse/test.txt"]
```
With profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> []
```
It turns out the issue is related to missing error handling in `dir.c`.
The Datadog Ruby profiler, like stackprof, pf2 and vernier, uses unix signals to interrupt the currently-active thread and take a sample (usually `SIGPROF`). When some system calls get interrupted by a signal, they return an [EINTR error code](https://man7.org/linux/man-pages/man7/signal.7.html#:~:text=Interruption%20of%20system%20calls%20and%20library%20functions%20by%20signal%20handlers) back to the caller.
Consider for instance the implementation of `dir_each_entry` in `dir.c`:
```c
static VALUE
dir_each_entry(VALUE dir, VALUE (*each)(VALUE, VALUE), VALUE arg, int children_only)
{
struct dir_data *dirp;
struct dirent *dp;
IF_NORMALIZE_UTF8PATH(int norm_p);
GetDIR(dir, dirp);
rewinddir(dirp->dir);
IF_NORMALIZE_UTF8PATH(norm_p = need_normalization(dirp->dir, RSTRING_PTR(dirp->path)));
while ((dp = READDIR(dirp->dir, dirp->enc)) != NULL) {
// ... do things
}
return dir;
}
```
If `READDIR` returns `NULL`, then `dir_each_entry` assumes it has iterated the entire directory. But looking [at the man page for `readdir`](https://man7.org/linux/man-pages/man3/readdir.3.html) we see the following sharp edge (emphasis mine):
> It returns NULL on reaching the end of the directory stream **or if an error occurred**.
So what's happening in this situation is: `readdir` gets interrupted, returns `NULL` + sets errno to `EINTR`. But `dir_each_entry` doesn't check `errno`, so rather than raising an exception to flag the issue, it treats it as if the end of the directory has been reached.
#### How to reproduce
Reproducing this is somewhat annoying, because it's dependent on timing: the signal must arrive at the exact time the dir API is getting executed.
I was able to reproduce this every time by using the [google cloud `gcsfuse`](https://cloud.google.com/storage/docs/gcs-fuse) tool. This somewhat makes sense -- a remote filesystem is much slower than a local one, so there's a much bigger window of opportunity for a signal to arrive while the system call is blocked.
Here's an example I included in https://github.com/DataDog/dd-trace-rb/pull/3720:
```
# Not shown: Set up a trial google cloud account, install gcsfuse, create a cloud storage bucket and put in some test files
$ gcsfuse test_fs_dd_trace_rb fuse-testing/
$ ls fuse-testing/
hello.txt test.html test.txt
# Not shown: Add `datadog` gem to `Gemfile`
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
[]
```
Let me know if you'd like me to try to create a reproducer that does not depend on the `datadog` gem.
#### Additional notes
I've spent quite some time looking at the `dir.c` sources, and here's the full list of APIs that suffer from issues:
* `dir_each_entry` does not check `errno`; all of its users have interruption bugs
* `dir_tell` will return -1 instead of the correct position (which means that passing -1 to `dir_seek`/`dir_set_pos` will cause it to not list the directory properly)
* `do_opendir` an error in system calls will only sometimes be turned into a raised exception
* Indirect callers that pass in rb_glob_error as errfunc: rb_glob, Dir.[], Dir.glob
* Indirect callers that pass in 0 as errfunc: ruby_glob, ruby_brace_glob
* `glob_opendir` does not check errno; all of its users have interruption bugs
* `glob_getent` does not check errno; all of its users have interruption bugs
* `nogvl_dir_empty_p` does not check errno (of readdir! it actually checks for opendir); all of its users have interruption bugs
Less sure about these:
* `do_stat`/`do_lstat` will turn errors into warnings (unclear if enabled or disabled by default)
* `need_normalization` calls `fgetattrlist` / `getattrlist` and all errors `(ret != 0)` are treated in the same way
* `rb_glob_error` is and `rb_glob_caller` leave exceptions as pending and rely on callers to raise them properly
* Error handling of `rb_home_dir_of` and `rb_default_home_dir` are a bit suspicious
As a workaround in the Datadog Ruby profiler, in https://github.com/DataDog/dd-trace-rb/pull/3720 I've added monkey patches to all of the Ruby-level APIs that use the above functions and mask out `SIGPROF` so these calls are never interrupted.
This solution is does successfully work around the issue, although it prevents the profiler from sampling during these system calls, which will mean less visibility if e.g. these calls are taking a long time. And well, maintaining monkey patches is always problematic for future Ruby compatibility.
---Files--------------------------------
readdir-bug-repro.c (2.11 KB)
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* [ruby-core:118358] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
2024-06-19 10:10 [ruby-core:118346] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted ivoanjo (Ivo Anjo) via ruby-core
` (2 preceding siblings ...)
2024-06-19 14:04 ` [ruby-core:118354] " ivoanjo (Ivo Anjo) via ruby-core
@ 2024-06-19 19:57 ` Eregon (Benoit Daloze) via ruby-core
2024-06-20 5:02 ` [ruby-core:118359] " mame (Yusuke Endoh) via ruby-core
` (6 subsequent siblings)
10 siblings, 0 replies; 13+ messages in thread
From: Eregon (Benoit Daloze) via ruby-core @ 2024-06-19 19:57 UTC (permalink / raw)
To: ruby-core; +Cc: Eregon (Benoit Daloze)
Issue #20586 has been updated by Eregon (Benoit Daloze).
If it returns EINTR, then we should retry, because that's already the behavior for all other blocking syscalls like `read(2)`, etc.
Is the SIGPROF handler installed with [sigaction()](https://man7.org/linux/man-pages/man2/sigaction.2.html) without the `SA_RESTART` flag?
Does it still happen if the `SA_RESTART` is passed to `sigaction()`? If yes it sounds like a possible bug of gcsfuse.
I suppose it can make sense the SIGPROF handler does not pass `SA_RESTART` to be able to profile during blocking syscalls.
But depending on the method of profiling it might be unnecessary to interrupt syscalls, in which case it would be better to pass `SA_RESTART`.
But in any case we should retry in CRuby, because for instance SIGVTALRM, used internally, on purpose interrupts syscalls.
----------------------------------------
Bug #20586: Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
https://bugs.ruby-lang.org/issues/20586#change-108865
* Author: ivoanjo (Ivo Anjo)
* Status: Open
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
#### Background
Hey! I work for Datadog on the Ruby profiler part of the [`datadog` (previously `ddtrace`)](https://github.com/datadog/dd-trace-rb) gem.
A customer reached [out with an issue](https://github.com/DataDog/dd-trace-rb/issues/3450) where enabling the profiler made `Dir.glob` return no files for a given folder:
Without profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> ["/gcsfuse/test.html", "/gcsfuse/test.txt"]
```
With profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> []
```
It turns out the issue is related to missing error handling in `dir.c`.
The Datadog Ruby profiler, like stackprof, pf2 and vernier, uses unix signals to interrupt the currently-active thread and take a sample (usually `SIGPROF`). When some system calls get interrupted by a signal, they return an [EINTR error code](https://man7.org/linux/man-pages/man7/signal.7.html#:~:text=Interruption%20of%20system%20calls%20and%20library%20functions%20by%20signal%20handlers) back to the caller.
Consider for instance the implementation of `dir_each_entry` in `dir.c`:
```c
static VALUE
dir_each_entry(VALUE dir, VALUE (*each)(VALUE, VALUE), VALUE arg, int children_only)
{
struct dir_data *dirp;
struct dirent *dp;
IF_NORMALIZE_UTF8PATH(int norm_p);
GetDIR(dir, dirp);
rewinddir(dirp->dir);
IF_NORMALIZE_UTF8PATH(norm_p = need_normalization(dirp->dir, RSTRING_PTR(dirp->path)));
while ((dp = READDIR(dirp->dir, dirp->enc)) != NULL) {
// ... do things
}
return dir;
}
```
If `READDIR` returns `NULL`, then `dir_each_entry` assumes it has iterated the entire directory. But looking [at the man page for `readdir`](https://man7.org/linux/man-pages/man3/readdir.3.html) we see the following sharp edge (emphasis mine):
> It returns NULL on reaching the end of the directory stream **or if an error occurred**.
So what's happening in this situation is: `readdir` gets interrupted, returns `NULL` + sets errno to `EINTR`. But `dir_each_entry` doesn't check `errno`, so rather than raising an exception to flag the issue, it treats it as if the end of the directory has been reached.
#### How to reproduce
Reproducing this is somewhat annoying, because it's dependent on timing: the signal must arrive at the exact time the dir API is getting executed.
I was able to reproduce this every time by using the [google cloud `gcsfuse`](https://cloud.google.com/storage/docs/gcs-fuse) tool. This somewhat makes sense -- a remote filesystem is much slower than a local one, so there's a much bigger window of opportunity for a signal to arrive while the system call is blocked.
Here's an example I included in https://github.com/DataDog/dd-trace-rb/pull/3720:
```
# Not shown: Set up a trial google cloud account, install gcsfuse, create a cloud storage bucket and put in some test files
$ gcsfuse test_fs_dd_trace_rb fuse-testing/
$ ls fuse-testing/
hello.txt test.html test.txt
# Not shown: Add `datadog` gem to `Gemfile`
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
[]
```
Let me know if you'd like me to try to create a reproducer that does not depend on the `datadog` gem.
#### Additional notes
I've spent quite some time looking at the `dir.c` sources, and here's the full list of APIs that suffer from issues:
* `dir_each_entry` does not check `errno`; all of its users have interruption bugs
* `dir_tell` will return -1 instead of the correct position (which means that passing -1 to `dir_seek`/`dir_set_pos` will cause it to not list the directory properly)
* `do_opendir` an error in system calls will only sometimes be turned into a raised exception
* Indirect callers that pass in rb_glob_error as errfunc: rb_glob, Dir.[], Dir.glob
* Indirect callers that pass in 0 as errfunc: ruby_glob, ruby_brace_glob
* `glob_opendir` does not check errno; all of its users have interruption bugs
* `glob_getent` does not check errno; all of its users have interruption bugs
* `nogvl_dir_empty_p` does not check errno (of readdir! it actually checks for opendir); all of its users have interruption bugs
Less sure about these:
* `do_stat`/`do_lstat` will turn errors into warnings (unclear if enabled or disabled by default)
* `need_normalization` calls `fgetattrlist` / `getattrlist` and all errors `(ret != 0)` are treated in the same way
* `rb_glob_error` is and `rb_glob_caller` leave exceptions as pending and rely on callers to raise them properly
* Error handling of `rb_home_dir_of` and `rb_default_home_dir` are a bit suspicious
As a workaround in the Datadog Ruby profiler, in https://github.com/DataDog/dd-trace-rb/pull/3720 I've added monkey patches to all of the Ruby-level APIs that use the above functions and mask out `SIGPROF` so these calls are never interrupted.
This solution is does successfully work around the issue, although it prevents the profiler from sampling during these system calls, which will mean less visibility if e.g. these calls are taking a long time. And well, maintaining monkey patches is always problematic for future Ruby compatibility.
---Files--------------------------------
readdir-bug-repro.c (2.11 KB)
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* [ruby-core:118359] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
2024-06-19 10:10 [ruby-core:118346] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted ivoanjo (Ivo Anjo) via ruby-core
` (3 preceding siblings ...)
2024-06-19 19:57 ` [ruby-core:118358] " Eregon (Benoit Daloze) via ruby-core
@ 2024-06-20 5:02 ` mame (Yusuke Endoh) via ruby-core
2024-06-20 9:53 ` [ruby-core:118364] " ivoanjo (Ivo Anjo) via ruby-core
` (5 subsequent siblings)
10 siblings, 0 replies; 13+ messages in thread
From: mame (Yusuke Endoh) via ruby-core @ 2024-06-20 5:02 UTC (permalink / raw)
To: ruby-core; +Cc: mame (Yusuke Endoh)
Issue #20586 has been updated by mame (Yusuke Endoh).
I don't think it's a good idea to retry `readdir` blindly.
It is probably that the FUSE driver throws EINTR when communicating, but it is not clear what kind of processing is done before and after the communication. It is possible that the internal state is inconsistent after readdir returns EINTR.
Then, should we rewinddir and try again? Or do we need to closedir and opendir again?
I don't think this is a problem we should deal with in our imagination. If we are serious about this problem, we should start by reporting the problem to the Linux kernel and asking them to clarify the behavior on their manpage.
As for Ruby, I think it is good to raise an exception by `rb_sys_fail` for the time being.
And as for SIGPROF, it would be easy to set SA_RESTART. In fact, venier seems to set it.
https://github.com/jhawthorn/vernier/blob/beb8ca2561c7924b0d7e8b4759681835cf21df80/ext/vernier/vernier.cc#L1606
----------------------------------------
Bug #20586: Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
https://bugs.ruby-lang.org/issues/20586#change-108866
* Author: ivoanjo (Ivo Anjo)
* Status: Open
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
#### Background
Hey! I work for Datadog on the Ruby profiler part of the [`datadog` (previously `ddtrace`)](https://github.com/datadog/dd-trace-rb) gem.
A customer reached [out with an issue](https://github.com/DataDog/dd-trace-rb/issues/3450) where enabling the profiler made `Dir.glob` return no files for a given folder:
Without profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> ["/gcsfuse/test.html", "/gcsfuse/test.txt"]
```
With profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> []
```
It turns out the issue is related to missing error handling in `dir.c`.
The Datadog Ruby profiler, like stackprof, pf2 and vernier, uses unix signals to interrupt the currently-active thread and take a sample (usually `SIGPROF`). When some system calls get interrupted by a signal, they return an [EINTR error code](https://man7.org/linux/man-pages/man7/signal.7.html#:~:text=Interruption%20of%20system%20calls%20and%20library%20functions%20by%20signal%20handlers) back to the caller.
Consider for instance the implementation of `dir_each_entry` in `dir.c`:
```c
static VALUE
dir_each_entry(VALUE dir, VALUE (*each)(VALUE, VALUE), VALUE arg, int children_only)
{
struct dir_data *dirp;
struct dirent *dp;
IF_NORMALIZE_UTF8PATH(int norm_p);
GetDIR(dir, dirp);
rewinddir(dirp->dir);
IF_NORMALIZE_UTF8PATH(norm_p = need_normalization(dirp->dir, RSTRING_PTR(dirp->path)));
while ((dp = READDIR(dirp->dir, dirp->enc)) != NULL) {
// ... do things
}
return dir;
}
```
If `READDIR` returns `NULL`, then `dir_each_entry` assumes it has iterated the entire directory. But looking [at the man page for `readdir`](https://man7.org/linux/man-pages/man3/readdir.3.html) we see the following sharp edge (emphasis mine):
> It returns NULL on reaching the end of the directory stream **or if an error occurred**.
So what's happening in this situation is: `readdir` gets interrupted, returns `NULL` + sets errno to `EINTR`. But `dir_each_entry` doesn't check `errno`, so rather than raising an exception to flag the issue, it treats it as if the end of the directory has been reached.
#### How to reproduce
Reproducing this is somewhat annoying, because it's dependent on timing: the signal must arrive at the exact time the dir API is getting executed.
I was able to reproduce this every time by using the [google cloud `gcsfuse`](https://cloud.google.com/storage/docs/gcs-fuse) tool. This somewhat makes sense -- a remote filesystem is much slower than a local one, so there's a much bigger window of opportunity for a signal to arrive while the system call is blocked.
Here's an example I included in https://github.com/DataDog/dd-trace-rb/pull/3720:
```
# Not shown: Set up a trial google cloud account, install gcsfuse, create a cloud storage bucket and put in some test files
$ gcsfuse test_fs_dd_trace_rb fuse-testing/
$ ls fuse-testing/
hello.txt test.html test.txt
# Not shown: Add `datadog` gem to `Gemfile`
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
[]
```
Let me know if you'd like me to try to create a reproducer that does not depend on the `datadog` gem.
#### Additional notes
I've spent quite some time looking at the `dir.c` sources, and here's the full list of APIs that suffer from issues:
* `dir_each_entry` does not check `errno`; all of its users have interruption bugs
* `dir_tell` will return -1 instead of the correct position (which means that passing -1 to `dir_seek`/`dir_set_pos` will cause it to not list the directory properly)
* `do_opendir` an error in system calls will only sometimes be turned into a raised exception
* Indirect callers that pass in rb_glob_error as errfunc: rb_glob, Dir.[], Dir.glob
* Indirect callers that pass in 0 as errfunc: ruby_glob, ruby_brace_glob
* `glob_opendir` does not check errno; all of its users have interruption bugs
* `glob_getent` does not check errno; all of its users have interruption bugs
* `nogvl_dir_empty_p` does not check errno (of readdir! it actually checks for opendir); all of its users have interruption bugs
Less sure about these:
* `do_stat`/`do_lstat` will turn errors into warnings (unclear if enabled or disabled by default)
* `need_normalization` calls `fgetattrlist` / `getattrlist` and all errors `(ret != 0)` are treated in the same way
* `rb_glob_error` is and `rb_glob_caller` leave exceptions as pending and rely on callers to raise them properly
* Error handling of `rb_home_dir_of` and `rb_default_home_dir` are a bit suspicious
As a workaround in the Datadog Ruby profiler, in https://github.com/DataDog/dd-trace-rb/pull/3720 I've added monkey patches to all of the Ruby-level APIs that use the above functions and mask out `SIGPROF` so these calls are never interrupted.
This solution is does successfully work around the issue, although it prevents the profiler from sampling during these system calls, which will mean less visibility if e.g. these calls are taking a long time. And well, maintaining monkey patches is always problematic for future Ruby compatibility.
---Files--------------------------------
readdir-bug-repro.c (2.11 KB)
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* [ruby-core:118364] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
2024-06-19 10:10 [ruby-core:118346] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted ivoanjo (Ivo Anjo) via ruby-core
` (4 preceding siblings ...)
2024-06-20 5:02 ` [ruby-core:118359] " mame (Yusuke Endoh) via ruby-core
@ 2024-06-20 9:53 ` ivoanjo (Ivo Anjo) via ruby-core
2024-08-16 21:59 ` [ruby-core:118861] " jeremyevans0 (Jeremy Evans) via ruby-core
` (4 subsequent siblings)
10 siblings, 0 replies; 13+ messages in thread
From: ivoanjo (Ivo Anjo) via ruby-core @ 2024-06-20 9:53 UTC (permalink / raw)
To: ruby-core; +Cc: ivoanjo (Ivo Anjo)
Issue #20586 has been updated by ivoanjo (Ivo Anjo).
Eregon (Benoit Daloze) wrote in #note-4:
> Is the SIGPROF handler installed with [sigaction()](https://man7.org/linux/man-pages/man2/sigaction.2.html) without the `SA_RESTART` flag?
> Does it still happen if the `SA_RESTART` is passed to `sigaction()`? If yes it sounds like a possible bug of gcsfuse.
Unfortunately the semantics aren't as simple as that :/
Both the reproducer I shared above as well [`readdir-bug-repro.c`](https://bugs.ruby-lang.org/attachments/9751) as well as [the Datadog Ruby profiler](https://github.com/DataDog/dd-trace-rb/blob/4b1901101c0f4d1033558bd9c0ff2e1c2b963bc7/ext/datadog_profiling_native_extension/setup_signal_handler.c#L35) set `SA_RESTART` when adding the signal handler.
The [man page](https://man7.org/linux/man-pages/man7/signal.7.html#:~:text=Interruption%20of%20system%20calls%20and%20library%20functions%20by%20signal%20handlers) has a section mentioning some system calls are not restarted, regardless of the flag.
> The following interfaces are never restarted after being
> interrupted by a signal handler, regardless of the use of
> SA_RESTART; they always fail with the error EINTR when
> interrupted by a signal handler:
> (...)
`readdir` is not mentioned in this section, but it's also not mentioned in the section about calls that are restarted. So it's in a weird limbo.
mame (Yusuke Endoh) wrote in #note-5:
> It is probably that the FUSE driver throws EINTR when communicating, but it is not clear what kind of processing is done before and after the communication. It is possible that the internal state is inconsistent after readdir returns EINTR.
>
> Then, should we rewinddir and try again? Or do we need to closedir and opendir again?
>
> I don't think this is a problem we should deal with in our imagination. If we are serious about this problem, we should start by reporting the problem to the Linux kernel and asking them to clarify the behavior on their manpage.
Yeah -- I suspect the kernel/FUSE should hide a lot of these, otherwise most userland apps will get it wrong, but on the docs I could find it's not clear this behavior is entirely expected.
> As for Ruby, I think it is good to raise an exception by `rb_sys_fail` for the time being.
This is my thinking as well -- if this does turn out to be a linux/FUSE/etc bug, it'll probably take a long time to roll out the fix so having Ruby do the right thing in the face of the weird issue is probably worth it?
----------------------------------------
Bug #20586: Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
https://bugs.ruby-lang.org/issues/20586#change-108870
* Author: ivoanjo (Ivo Anjo)
* Status: Open
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
#### Background
Hey! I work for Datadog on the Ruby profiler part of the [`datadog` (previously `ddtrace`)](https://github.com/datadog/dd-trace-rb) gem.
A customer reached [out with an issue](https://github.com/DataDog/dd-trace-rb/issues/3450) where enabling the profiler made `Dir.glob` return no files for a given folder:
Without profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> ["/gcsfuse/test.html", "/gcsfuse/test.txt"]
```
With profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> []
```
It turns out the issue is related to missing error handling in `dir.c`.
The Datadog Ruby profiler, like stackprof, pf2 and vernier, uses unix signals to interrupt the currently-active thread and take a sample (usually `SIGPROF`). When some system calls get interrupted by a signal, they return an [EINTR error code](https://man7.org/linux/man-pages/man7/signal.7.html#:~:text=Interruption%20of%20system%20calls%20and%20library%20functions%20by%20signal%20handlers) back to the caller.
Consider for instance the implementation of `dir_each_entry` in `dir.c`:
```c
static VALUE
dir_each_entry(VALUE dir, VALUE (*each)(VALUE, VALUE), VALUE arg, int children_only)
{
struct dir_data *dirp;
struct dirent *dp;
IF_NORMALIZE_UTF8PATH(int norm_p);
GetDIR(dir, dirp);
rewinddir(dirp->dir);
IF_NORMALIZE_UTF8PATH(norm_p = need_normalization(dirp->dir, RSTRING_PTR(dirp->path)));
while ((dp = READDIR(dirp->dir, dirp->enc)) != NULL) {
// ... do things
}
return dir;
}
```
If `READDIR` returns `NULL`, then `dir_each_entry` assumes it has iterated the entire directory. But looking [at the man page for `readdir`](https://man7.org/linux/man-pages/man3/readdir.3.html) we see the following sharp edge (emphasis mine):
> It returns NULL on reaching the end of the directory stream **or if an error occurred**.
So what's happening in this situation is: `readdir` gets interrupted, returns `NULL` + sets errno to `EINTR`. But `dir_each_entry` doesn't check `errno`, so rather than raising an exception to flag the issue, it treats it as if the end of the directory has been reached.
#### How to reproduce
Reproducing this is somewhat annoying, because it's dependent on timing: the signal must arrive at the exact time the dir API is getting executed.
I was able to reproduce this every time by using the [google cloud `gcsfuse`](https://cloud.google.com/storage/docs/gcs-fuse) tool. This somewhat makes sense -- a remote filesystem is much slower than a local one, so there's a much bigger window of opportunity for a signal to arrive while the system call is blocked.
Here's an example I included in https://github.com/DataDog/dd-trace-rb/pull/3720:
```
# Not shown: Set up a trial google cloud account, install gcsfuse, create a cloud storage bucket and put in some test files
$ gcsfuse test_fs_dd_trace_rb fuse-testing/
$ ls fuse-testing/
hello.txt test.html test.txt
# Not shown: Add `datadog` gem to `Gemfile`
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
[]
```
Let me know if you'd like me to try to create a reproducer that does not depend on the `datadog` gem.
#### Additional notes
I've spent quite some time looking at the `dir.c` sources, and here's the full list of APIs that suffer from issues:
* `dir_each_entry` does not check `errno`; all of its users have interruption bugs
* `dir_tell` will return -1 instead of the correct position (which means that passing -1 to `dir_seek`/`dir_set_pos` will cause it to not list the directory properly)
* `do_opendir` an error in system calls will only sometimes be turned into a raised exception
* Indirect callers that pass in rb_glob_error as errfunc: rb_glob, Dir.[], Dir.glob
* Indirect callers that pass in 0 as errfunc: ruby_glob, ruby_brace_glob
* `glob_opendir` does not check errno; all of its users have interruption bugs
* `glob_getent` does not check errno; all of its users have interruption bugs
* `nogvl_dir_empty_p` does not check errno (of readdir! it actually checks for opendir); all of its users have interruption bugs
Less sure about these:
* `do_stat`/`do_lstat` will turn errors into warnings (unclear if enabled or disabled by default)
* `need_normalization` calls `fgetattrlist` / `getattrlist` and all errors `(ret != 0)` are treated in the same way
* `rb_glob_error` is and `rb_glob_caller` leave exceptions as pending and rely on callers to raise them properly
* Error handling of `rb_home_dir_of` and `rb_default_home_dir` are a bit suspicious
As a workaround in the Datadog Ruby profiler, in https://github.com/DataDog/dd-trace-rb/pull/3720 I've added monkey patches to all of the Ruby-level APIs that use the above functions and mask out `SIGPROF` so these calls are never interrupted.
This solution is does successfully work around the issue, although it prevents the profiler from sampling during these system calls, which will mean less visibility if e.g. these calls are taking a long time. And well, maintaining monkey patches is always problematic for future Ruby compatibility.
---Files--------------------------------
readdir-bug-repro.c (2.11 KB)
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* [ruby-core:118861] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
2024-06-19 10:10 [ruby-core:118346] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted ivoanjo (Ivo Anjo) via ruby-core
` (5 preceding siblings ...)
2024-06-20 9:53 ` [ruby-core:118364] " ivoanjo (Ivo Anjo) via ruby-core
@ 2024-08-16 21:59 ` jeremyevans0 (Jeremy Evans) via ruby-core
2024-08-18 3:46 ` [ruby-core:118874] " Alan D. Salewski via ruby-core
2024-08-19 9:19 ` [ruby-core:118890] " ivoanjo (Ivo Anjo) via ruby-core
` (3 subsequent siblings)
10 siblings, 1 reply; 13+ messages in thread
From: jeremyevans0 (Jeremy Evans) via ruby-core @ 2024-08-16 21:59 UTC (permalink / raw)
To: ruby-core; +Cc: jeremyevans0 (Jeremy Evans)
Issue #20586 has been updated by jeremyevans0 (Jeremy Evans).
I've submitted a pull request that I hope will address most of these issues: https://github.com/ruby/ruby/pull/11393
It adds error checking for `readdir`, `telldir`, and `closedir` calls in `dir.c`.
The `closedir` check caught an actual bug exposed in Ruby's test suite, where a legitimate `EBADF` error returned by `closedir` would be ignored.
ivoanjo (Ivo Anjo) wrote:
> I've spent quite some time looking at the `dir.c` sources, and here's the full list of APIs that suffer from issues:
>
> * `dir_each_entry` does not check `errno`; all of its users have interruption bugs
Fixed by adding `readdir` error checking.
> * `dir_tell` will return -1 instead of the correct position (which means that passing -1 to `dir_seek`/`dir_set_pos` will cause it to not list the directory properly)
Fixed by adding `telldir` error checking.
> * `do_opendir` an error in system calls will only sometimes be turned into a raised exception
> * Indirect callers that pass in rb_glob_error as errfunc: rb_glob, Dir.[], Dir.glob
> * Indirect callers that pass in 0 as errfunc: ruby_glob, ruby_brace_glob
This appears to be by design, `ruby_glob` documentation states: "Identical to rb_glob(), except it returns opaque exception states instead of raising exceptions.", and `ruby_brace_glob` documentation states "Identical to ruby_glob()". Neither `ruby_glob` nor `ruby_brace_glob` are used internally, they are only for use by extensions.
> * `glob_opendir` does not check errno; all of its users have interruption bugs
> * `glob_getent` does not check errno; all of its users have interruption bugs
> * `nogvl_dir_empty_p` does not check errno (of readdir! it actually checks for opendir); all of its users have interruption bugs
Fixed by adding `readdir` error checking.
> Less sure about these:
>
> * `do_stat`/`do_lstat` will turn errors into warnings (unclear if enabled or disabled by default)
Use of `sys_warning` happens in many other places for similar reasons (also in `do_opendir`), so I don't think we should make changes just for those functions. These are verbose-mode only warnings.
> * `need_normalization` calls `fgetattrlist` / `getattrlist` and all errors `(ret != 0)` are treated in the same way
If an error occurs, that is currently treated as not needing normalization, which seems reasonable. One of the errors it can raise is ENOTSUP, indicating getattrlist is not supported, which can be taken as not needing normalization, since I think volumes that require normalization have support for getattrlist. Even if other cases, if there is a problem with the operation, it will occur later when the non-normalized file path is used.
> * `rb_glob_error` is and `rb_glob_caller` leave exceptions as pending and rely on callers to raise them properly
I doubt errors can be raised properly the way the functions are designed. The use of `rb_protect` means you only know whether there was an exception raised, not anything about the exception.
> * Error handling of `rb_home_dir_of` and `rb_default_home_dir` are a bit suspicious
`rb_home_dir_of` I have switching to use `rb_getpwdirnam_for_login` in pull request 11202, and that function has proper error checks. `rb_default_home_dir` calls `rb_getpwdirnam_for_login`, `rb_getlogin`, and `rb_getpwdiruid`. Can you let me know what part of `rb_default_home_dir` still looks suspicious?
----------------------------------------
Bug #20586: Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
https://bugs.ruby-lang.org/issues/20586#change-109436
* Author: ivoanjo (Ivo Anjo)
* Status: Open
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
#### Background
Hey! I work for Datadog on the Ruby profiler part of the [`datadog` (previously `ddtrace`)](https://github.com/datadog/dd-trace-rb) gem.
A customer reached [out with an issue](https://github.com/DataDog/dd-trace-rb/issues/3450) where enabling the profiler made `Dir.glob` return no files for a given folder:
Without profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> ["/gcsfuse/test.html", "/gcsfuse/test.txt"]
```
With profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> []
```
It turns out the issue is related to missing error handling in `dir.c`.
The Datadog Ruby profiler, like stackprof, pf2 and vernier, uses unix signals to interrupt the currently-active thread and take a sample (usually `SIGPROF`). When some system calls get interrupted by a signal, they return an [EINTR error code](https://man7.org/linux/man-pages/man7/signal.7.html#:~:text=Interruption%20of%20system%20calls%20and%20library%20functions%20by%20signal%20handlers) back to the caller.
Consider for instance the implementation of `dir_each_entry` in `dir.c`:
```c
static VALUE
dir_each_entry(VALUE dir, VALUE (*each)(VALUE, VALUE), VALUE arg, int children_only)
{
struct dir_data *dirp;
struct dirent *dp;
IF_NORMALIZE_UTF8PATH(int norm_p);
GetDIR(dir, dirp);
rewinddir(dirp->dir);
IF_NORMALIZE_UTF8PATH(norm_p = need_normalization(dirp->dir, RSTRING_PTR(dirp->path)));
while ((dp = READDIR(dirp->dir, dirp->enc)) != NULL) {
// ... do things
}
return dir;
}
```
If `READDIR` returns `NULL`, then `dir_each_entry` assumes it has iterated the entire directory. But looking [at the man page for `readdir`](https://man7.org/linux/man-pages/man3/readdir.3.html) we see the following sharp edge (emphasis mine):
> It returns NULL on reaching the end of the directory stream **or if an error occurred**.
So what's happening in this situation is: `readdir` gets interrupted, returns `NULL` + sets errno to `EINTR`. But `dir_each_entry` doesn't check `errno`, so rather than raising an exception to flag the issue, it treats it as if the end of the directory has been reached.
#### How to reproduce
Reproducing this is somewhat annoying, because it's dependent on timing: the signal must arrive at the exact time the dir API is getting executed.
I was able to reproduce this every time by using the [google cloud `gcsfuse`](https://cloud.google.com/storage/docs/gcs-fuse) tool. This somewhat makes sense -- a remote filesystem is much slower than a local one, so there's a much bigger window of opportunity for a signal to arrive while the system call is blocked.
Here's an example I included in https://github.com/DataDog/dd-trace-rb/pull/3720:
```
# Not shown: Set up a trial google cloud account, install gcsfuse, create a cloud storage bucket and put in some test files
$ gcsfuse test_fs_dd_trace_rb fuse-testing/
$ ls fuse-testing/
hello.txt test.html test.txt
# Not shown: Add `datadog` gem to `Gemfile`
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
[]
```
Let me know if you'd like me to try to create a reproducer that does not depend on the `datadog` gem.
#### Additional notes
I've spent quite some time looking at the `dir.c` sources, and here's the full list of APIs that suffer from issues:
* `dir_each_entry` does not check `errno`; all of its users have interruption bugs
* `dir_tell` will return -1 instead of the correct position (which means that passing -1 to `dir_seek`/`dir_set_pos` will cause it to not list the directory properly)
* `do_opendir` an error in system calls will only sometimes be turned into a raised exception
* Indirect callers that pass in rb_glob_error as errfunc: rb_glob, Dir.[], Dir.glob
* Indirect callers that pass in 0 as errfunc: ruby_glob, ruby_brace_glob
* `glob_opendir` does not check errno; all of its users have interruption bugs
* `glob_getent` does not check errno; all of its users have interruption bugs
* `nogvl_dir_empty_p` does not check errno (of readdir! it actually checks for opendir); all of its users have interruption bugs
Less sure about these:
* `do_stat`/`do_lstat` will turn errors into warnings (unclear if enabled or disabled by default)
* `need_normalization` calls `fgetattrlist` / `getattrlist` and all errors `(ret != 0)` are treated in the same way
* `rb_glob_error` is and `rb_glob_caller` leave exceptions as pending and rely on callers to raise them properly
* Error handling of `rb_home_dir_of` and `rb_default_home_dir` are a bit suspicious
As a workaround in the Datadog Ruby profiler, in https://github.com/DataDog/dd-trace-rb/pull/3720 I've added monkey patches to all of the Ruby-level APIs that use the above functions and mask out `SIGPROF` so these calls are never interrupted.
This solution is does successfully work around the issue, although it prevents the profiler from sampling during these system calls, which will mean less visibility if e.g. these calls are taking a long time. And well, maintaining monkey patches is always problematic for future Ruby compatibility.
---Files--------------------------------
readdir-bug-repro.c (2.11 KB)
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* [ruby-core:118874] Re: [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
2024-08-16 21:59 ` [ruby-core:118861] " jeremyevans0 (Jeremy Evans) via ruby-core
@ 2024-08-18 3:46 ` Alan D. Salewski via ruby-core
0 siblings, 0 replies; 13+ messages in thread
From: Alan D. Salewski via ruby-core @ 2024-08-18 3:46 UTC (permalink / raw)
To: ruby-core; +Cc: Alan D. Salewski
On 2024-08-16 21:59:04, "jeremyevans0 (Jeremy Evans) via ruby-core" <ruby-core@ml.ruby-lang.org> spake thus:
>Issue #20586 has been updated by jeremyevans0 (Jeremy Evans).
>
>
>I've submitted a pull request that I hope will address most of these issues: https://github.com/ruby/ruby/pull/11393
[...]
>
>ivoanjo (Ivo Anjo) wrote:
>> I've spent quite some time looking at the `dir.c` sources, and here's the full list of APIs that suffer from issues:
[...]
>> Less sure about these:
[...]
>> * Error handling of `rb_home_dir_of` and `rb_default_home_dir` are a bit suspicious
>
>`rb_home_dir_of` I have switching to use `rb_getpwdirnam_for_login` in pull request 11202, and that function has proper error checks. `rb_default_home_dir` calls `rb_getpwdirnam_for_login`, `rb_getlogin`, and `rb_getpwdiruid`. Can you let me know what part of `rb_default_home_dir` still looks suspicious?
[...]
`rb_getpwdirnam_for_login` currently does not have any explicit
handling for `EINTR`. However, it /is/ being treated as an
error[0][1], so should never result in a false negative ("no record
found for the supplied login_name").
I think it could be made to retry on `EINTR`, so signals such as
`SIGPROF` triggering `EINTR` would not result in unnecessary
errors. An untested patch for this is included below.
Thanks,
-Al
[0] the same as if `EIO`, `ENOMEM`, `EMFILE`, or `ENFILE`
[1] Calls to either `getpwnam_r` or `getpwnam` that result in errno
set to `EINTR` result in `rb_syserr_fail(...)`.
-->8--
diff --git a/process.c b/process.c
index b1f9931f06..a71951d496 100644
--- a/process.c
+++ b/process.c
@@ -5827,8 +5827,13 @@ rb_getpwdirnam_for_login(VALUE login_name)
rb_str_set_len(getpwnm_tmp, bufsizenm);
int enm;
+ again1:
errno = 0;
while ((enm = getpwnam_r(login, &pwdnm, bufnm, bufsizenm, &pwptr)) != 0) {
+ if (enm == EINTR) {
+ /* call was interrupted, e.g., by SIGPROF or similar */
+ goto again1;
+ }
if (enm == ENOENT || enm== ESRCH || enm == EBADF || enm == EPERM) {
/* not found; non-errors */
@@ -5859,16 +5864,23 @@ rb_getpwdirnam_for_login(VALUE login_name)
# elif USE_GETPWNAM
+ again2:
errno = 0;
pwptr = getpwnam(login);
if (pwptr) {
/* found it */
return rb_str_new_cstr(pwptr->pw_dir);
}
- if (errno
+ if (errno) {
+ if (errno == EINTR) {
+ /* call was interrupted, e.g., by SIGPROF or similar */
+ goto again2;
+ }
+
/* avoid treating as errors errno values that indicate "not found" */
- && ( errno != ENOENT && errno != ESRCH && errno != EBADF && errno != EPERM)) {
- rb_syserr_fail(errno, "getpwnam");
+ if ( errno != ENOENT && errno != ESRCH && errno != EBADF && errno != EPERM) {
+ rb_syserr_fail(errno, "getpwnam");
+ }
}
return Qnil; /* not found */
--8<--
--
a l a n d. s a l e w s k i
ads@salewski.email
salewski@att.net
https://github.com/salewski
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* [ruby-core:118890] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
2024-06-19 10:10 [ruby-core:118346] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted ivoanjo (Ivo Anjo) via ruby-core
` (6 preceding siblings ...)
2024-08-16 21:59 ` [ruby-core:118861] " jeremyevans0 (Jeremy Evans) via ruby-core
@ 2024-08-19 9:19 ` ivoanjo (Ivo Anjo) via ruby-core
2024-08-19 22:32 ` [ruby-core:118898] " jeremyevans0 (Jeremy Evans) via ruby-core
` (2 subsequent siblings)
10 siblings, 0 replies; 13+ messages in thread
From: ivoanjo (Ivo Anjo) via ruby-core @ 2024-08-19 9:19 UTC (permalink / raw)
To: ruby-core; +Cc: ivoanjo (Ivo Anjo)
Issue #20586 has been updated by ivoanjo (Ivo Anjo).
Hey! As usual thanks for picking up these thorny bugs :)
The fix looks good! In particular I've re-tested it with a modified version of the Datadog profiler where I force it to interrupt threads without GVL (the bug is very hard to trigger since https://bugs.ruby-lang.org/issues/20587 was fixed) and I can confirm I see the error showing up correctly now:
```
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
-e:1:in 'Dir.children': Interrupted system call - readdir (Errno::EINTR)
from -e:1:in '<main>'
```
> rb_home_dir_of I have switching to use rb_getpwdirnam_for_login in pull request 11202, and that function has proper error checks. rb_default_home_dir calls rb_getpwdirnam_for_login, rb_getlogin, and rb_getpwdiruid. Can you let me know what part of rb_default_home_dir still looks suspicious?
I've re-checked these, and between 11202 + me looking closer at some details and realizing they're correct I think it's mostly correct.
The only (potential?) issue I still see is with calls to `getlogin` -- that one seems to bottom out on a number of IO reads and whatnot so it does look like it may end up with an EINTR. Even with 11202, `rb_home_dir_of` is still calling `getlogin` directly and without error checking.
I also noticed that `ext/etc/etc.c` and `ext/pty/pty.c` seem to also use `getlogin` directly, so perhaps they'll also suffer from the same issue? 🤔
----------------------------------------
Bug #20586: Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
https://bugs.ruby-lang.org/issues/20586#change-109461
* Author: ivoanjo (Ivo Anjo)
* Status: Open
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
#### Background
Hey! I work for Datadog on the Ruby profiler part of the [`datadog` (previously `ddtrace`)](https://github.com/datadog/dd-trace-rb) gem.
A customer reached [out with an issue](https://github.com/DataDog/dd-trace-rb/issues/3450) where enabling the profiler made `Dir.glob` return no files for a given folder:
Without profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> ["/gcsfuse/test.html", "/gcsfuse/test.txt"]
```
With profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> []
```
It turns out the issue is related to missing error handling in `dir.c`.
The Datadog Ruby profiler, like stackprof, pf2 and vernier, uses unix signals to interrupt the currently-active thread and take a sample (usually `SIGPROF`). When some system calls get interrupted by a signal, they return an [EINTR error code](https://man7.org/linux/man-pages/man7/signal.7.html#:~:text=Interruption%20of%20system%20calls%20and%20library%20functions%20by%20signal%20handlers) back to the caller.
Consider for instance the implementation of `dir_each_entry` in `dir.c`:
```c
static VALUE
dir_each_entry(VALUE dir, VALUE (*each)(VALUE, VALUE), VALUE arg, int children_only)
{
struct dir_data *dirp;
struct dirent *dp;
IF_NORMALIZE_UTF8PATH(int norm_p);
GetDIR(dir, dirp);
rewinddir(dirp->dir);
IF_NORMALIZE_UTF8PATH(norm_p = need_normalization(dirp->dir, RSTRING_PTR(dirp->path)));
while ((dp = READDIR(dirp->dir, dirp->enc)) != NULL) {
// ... do things
}
return dir;
}
```
If `READDIR` returns `NULL`, then `dir_each_entry` assumes it has iterated the entire directory. But looking [at the man page for `readdir`](https://man7.org/linux/man-pages/man3/readdir.3.html) we see the following sharp edge (emphasis mine):
> It returns NULL on reaching the end of the directory stream **or if an error occurred**.
So what's happening in this situation is: `readdir` gets interrupted, returns `NULL` + sets errno to `EINTR`. But `dir_each_entry` doesn't check `errno`, so rather than raising an exception to flag the issue, it treats it as if the end of the directory has been reached.
#### How to reproduce
Reproducing this is somewhat annoying, because it's dependent on timing: the signal must arrive at the exact time the dir API is getting executed.
I was able to reproduce this every time by using the [google cloud `gcsfuse`](https://cloud.google.com/storage/docs/gcs-fuse) tool. This somewhat makes sense -- a remote filesystem is much slower than a local one, so there's a much bigger window of opportunity for a signal to arrive while the system call is blocked.
Here's an example I included in https://github.com/DataDog/dd-trace-rb/pull/3720:
```
# Not shown: Set up a trial google cloud account, install gcsfuse, create a cloud storage bucket and put in some test files
$ gcsfuse test_fs_dd_trace_rb fuse-testing/
$ ls fuse-testing/
hello.txt test.html test.txt
# Not shown: Add `datadog` gem to `Gemfile`
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
[]
```
Let me know if you'd like me to try to create a reproducer that does not depend on the `datadog` gem.
#### Additional notes
I've spent quite some time looking at the `dir.c` sources, and here's the full list of APIs that suffer from issues:
* `dir_each_entry` does not check `errno`; all of its users have interruption bugs
* `dir_tell` will return -1 instead of the correct position (which means that passing -1 to `dir_seek`/`dir_set_pos` will cause it to not list the directory properly)
* `do_opendir` an error in system calls will only sometimes be turned into a raised exception
* Indirect callers that pass in rb_glob_error as errfunc: rb_glob, Dir.[], Dir.glob
* Indirect callers that pass in 0 as errfunc: ruby_glob, ruby_brace_glob
* `glob_opendir` does not check errno; all of its users have interruption bugs
* `glob_getent` does not check errno; all of its users have interruption bugs
* `nogvl_dir_empty_p` does not check errno (of readdir! it actually checks for opendir); all of its users have interruption bugs
Less sure about these:
* `do_stat`/`do_lstat` will turn errors into warnings (unclear if enabled or disabled by default)
* `need_normalization` calls `fgetattrlist` / `getattrlist` and all errors `(ret != 0)` are treated in the same way
* `rb_glob_error` is and `rb_glob_caller` leave exceptions as pending and rely on callers to raise them properly
* Error handling of `rb_home_dir_of` and `rb_default_home_dir` are a bit suspicious
As a workaround in the Datadog Ruby profiler, in https://github.com/DataDog/dd-trace-rb/pull/3720 I've added monkey patches to all of the Ruby-level APIs that use the above functions and mask out `SIGPROF` so these calls are never interrupted.
This solution is does successfully work around the issue, although it prevents the profiler from sampling during these system calls, which will mean less visibility if e.g. these calls are taking a long time. And well, maintaining monkey patches is always problematic for future Ruby compatibility.
---Files--------------------------------
readdir-bug-repro.c (2.11 KB)
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* [ruby-core:118898] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
2024-06-19 10:10 [ruby-core:118346] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted ivoanjo (Ivo Anjo) via ruby-core
` (7 preceding siblings ...)
2024-08-19 9:19 ` [ruby-core:118890] " ivoanjo (Ivo Anjo) via ruby-core
@ 2024-08-19 22:32 ` jeremyevans0 (Jeremy Evans) via ruby-core
2024-08-21 22:52 ` [ruby-core:118911] " jeremyevans0 (Jeremy Evans) via ruby-core
2024-08-22 8:03 ` [ruby-core:118919] " ivoanjo (Ivo Anjo) via ruby-core
10 siblings, 0 replies; 13+ messages in thread
From: jeremyevans0 (Jeremy Evans) via ruby-core @ 2024-08-19 22:32 UTC (permalink / raw)
To: ruby-core; +Cc: jeremyevans0 (Jeremy Evans)
Issue #20586 has been updated by jeremyevans0 (Jeremy Evans).
Anonymous wrote in #note-8:
> `rb_getpwdirnam_for_login` currently does not have any explicit
> handling for `EINTR`. However, it /is/ being treated as an
> error[0][1], so should never result in a false negative ("no record
> found for the supplied login_name").
>
> I think it could be made to retry on `EINTR`, so signals such as
> `SIGPROF` triggering `EINTR` would not result in unnecessary
> errors. An untested patch for this is included below.
I think this is expected. The consensus seems to be that we should raise an error. Maybe we could change the handling to retry certain calls in the future, though. Maybe `getpwnam` is safe, but I'm not positive it is in all cases. It doesn't seem to have the same issues as `readdir`, but if the call does take a substantial amount of time, I could see it continually failing if SIGPROF interval is shorter than system call time.
ivoanjo (Ivo Anjo) wrote in #note-9:
> The only (potential?) issue I still see is with calls to `getlogin` -- that one seems to bottom out on a number of IO reads and whatnot so it does look like it may end up with an EINTR. Even with 11202, `rb_home_dir_of` is still calling `getlogin` directly and without error checking.
>
> I also noticed that `ext/etc/etc.c` and `ext/pty/pty.c` seem to also use `getlogin` directly, so perhaps they'll also suffer from the same issue? 🤔
That's a good point. I decided not to release GVL for `getlogin`, as I don't think it should block as it should not need to access the file system, just memory. However, I agree that we should still add error checking. Maybe I can do that in a separate pull request.
----------------------------------------
Bug #20586: Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
https://bugs.ruby-lang.org/issues/20586#change-109467
* Author: ivoanjo (Ivo Anjo)
* Status: Open
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
#### Background
Hey! I work for Datadog on the Ruby profiler part of the [`datadog` (previously `ddtrace`)](https://github.com/datadog/dd-trace-rb) gem.
A customer reached [out with an issue](https://github.com/DataDog/dd-trace-rb/issues/3450) where enabling the profiler made `Dir.glob` return no files for a given folder:
Without profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> ["/gcsfuse/test.html", "/gcsfuse/test.txt"]
```
With profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> []
```
It turns out the issue is related to missing error handling in `dir.c`.
The Datadog Ruby profiler, like stackprof, pf2 and vernier, uses unix signals to interrupt the currently-active thread and take a sample (usually `SIGPROF`). When some system calls get interrupted by a signal, they return an [EINTR error code](https://man7.org/linux/man-pages/man7/signal.7.html#:~:text=Interruption%20of%20system%20calls%20and%20library%20functions%20by%20signal%20handlers) back to the caller.
Consider for instance the implementation of `dir_each_entry` in `dir.c`:
```c
static VALUE
dir_each_entry(VALUE dir, VALUE (*each)(VALUE, VALUE), VALUE arg, int children_only)
{
struct dir_data *dirp;
struct dirent *dp;
IF_NORMALIZE_UTF8PATH(int norm_p);
GetDIR(dir, dirp);
rewinddir(dirp->dir);
IF_NORMALIZE_UTF8PATH(norm_p = need_normalization(dirp->dir, RSTRING_PTR(dirp->path)));
while ((dp = READDIR(dirp->dir, dirp->enc)) != NULL) {
// ... do things
}
return dir;
}
```
If `READDIR` returns `NULL`, then `dir_each_entry` assumes it has iterated the entire directory. But looking [at the man page for `readdir`](https://man7.org/linux/man-pages/man3/readdir.3.html) we see the following sharp edge (emphasis mine):
> It returns NULL on reaching the end of the directory stream **or if an error occurred**.
So what's happening in this situation is: `readdir` gets interrupted, returns `NULL` + sets errno to `EINTR`. But `dir_each_entry` doesn't check `errno`, so rather than raising an exception to flag the issue, it treats it as if the end of the directory has been reached.
#### How to reproduce
Reproducing this is somewhat annoying, because it's dependent on timing: the signal must arrive at the exact time the dir API is getting executed.
I was able to reproduce this every time by using the [google cloud `gcsfuse`](https://cloud.google.com/storage/docs/gcs-fuse) tool. This somewhat makes sense -- a remote filesystem is much slower than a local one, so there's a much bigger window of opportunity for a signal to arrive while the system call is blocked.
Here's an example I included in https://github.com/DataDog/dd-trace-rb/pull/3720:
```
# Not shown: Set up a trial google cloud account, install gcsfuse, create a cloud storage bucket and put in some test files
$ gcsfuse test_fs_dd_trace_rb fuse-testing/
$ ls fuse-testing/
hello.txt test.html test.txt
# Not shown: Add `datadog` gem to `Gemfile`
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
[]
```
Let me know if you'd like me to try to create a reproducer that does not depend on the `datadog` gem.
#### Additional notes
I've spent quite some time looking at the `dir.c` sources, and here's the full list of APIs that suffer from issues:
* `dir_each_entry` does not check `errno`; all of its users have interruption bugs
* `dir_tell` will return -1 instead of the correct position (which means that passing -1 to `dir_seek`/`dir_set_pos` will cause it to not list the directory properly)
* `do_opendir` an error in system calls will only sometimes be turned into a raised exception
* Indirect callers that pass in rb_glob_error as errfunc: rb_glob, Dir.[], Dir.glob
* Indirect callers that pass in 0 as errfunc: ruby_glob, ruby_brace_glob
* `glob_opendir` does not check errno; all of its users have interruption bugs
* `glob_getent` does not check errno; all of its users have interruption bugs
* `nogvl_dir_empty_p` does not check errno (of readdir! it actually checks for opendir); all of its users have interruption bugs
Less sure about these:
* `do_stat`/`do_lstat` will turn errors into warnings (unclear if enabled or disabled by default)
* `need_normalization` calls `fgetattrlist` / `getattrlist` and all errors `(ret != 0)` are treated in the same way
* `rb_glob_error` is and `rb_glob_caller` leave exceptions as pending and rely on callers to raise them properly
* Error handling of `rb_home_dir_of` and `rb_default_home_dir` are a bit suspicious
As a workaround in the Datadog Ruby profiler, in https://github.com/DataDog/dd-trace-rb/pull/3720 I've added monkey patches to all of the Ruby-level APIs that use the above functions and mask out `SIGPROF` so these calls are never interrupted.
This solution is does successfully work around the issue, although it prevents the profiler from sampling during these system calls, which will mean less visibility if e.g. these calls are taking a long time. And well, maintaining monkey patches is always problematic for future Ruby compatibility.
---Files--------------------------------
readdir-bug-repro.c (2.11 KB)
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* [ruby-core:118911] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
2024-06-19 10:10 [ruby-core:118346] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted ivoanjo (Ivo Anjo) via ruby-core
` (8 preceding siblings ...)
2024-08-19 22:32 ` [ruby-core:118898] " jeremyevans0 (Jeremy Evans) via ruby-core
@ 2024-08-21 22:52 ` jeremyevans0 (Jeremy Evans) via ruby-core
2024-08-22 8:03 ` [ruby-core:118919] " ivoanjo (Ivo Anjo) via ruby-core
10 siblings, 0 replies; 13+ messages in thread
From: jeremyevans0 (Jeremy Evans) via ruby-core @ 2024-08-21 22:52 UTC (permalink / raw)
To: ruby-core; +Cc: jeremyevans0 (Jeremy Evans)
Issue #20586 has been updated by jeremyevans0 (Jeremy Evans).
jeremyevans0 (Jeremy Evans) wrote in #note-10:
> ivoanjo (Ivo Anjo) wrote in #note-9:
> > The only (potential?) issue I still see is with calls to `getlogin` -- that one seems to bottom out on a number of IO reads and whatnot so it does look like it may end up with an EINTR. Even with 11202, `rb_home_dir_of` is still calling `getlogin` directly and without error checking.
> >
> > I also noticed that `ext/etc/etc.c` and `ext/pty/pty.c` seem to also use `getlogin` directly, so perhaps they'll also suffer from the same issue? 🤔
>
> That's a good point. I decided not to release GVL for `getlogin`, as I don't think it should block as it should not need to access the file system, just memory. However, I agree that we should still add error checking. Maybe I can do that in a separate pull request.
I've submitted a pull request to add error checking to `getlogin` to `rb_home_dir_of` and `PTY.spawn`: https://github.com/ruby/ruby/pull/11427
I reviewed `getlogin` usage in `etc` and it already handles errors, falling back to `USER` environment variable:
```c
login = getlogin();
if (!login) login = getenv("USER");
```
----------------------------------------
Bug #20586: Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
https://bugs.ruby-lang.org/issues/20586#change-109481
* Author: ivoanjo (Ivo Anjo)
* Status: Open
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
#### Background
Hey! I work for Datadog on the Ruby profiler part of the [`datadog` (previously `ddtrace`)](https://github.com/datadog/dd-trace-rb) gem.
A customer reached [out with an issue](https://github.com/DataDog/dd-trace-rb/issues/3450) where enabling the profiler made `Dir.glob` return no files for a given folder:
Without profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> ["/gcsfuse/test.html", "/gcsfuse/test.txt"]
```
With profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> []
```
It turns out the issue is related to missing error handling in `dir.c`.
The Datadog Ruby profiler, like stackprof, pf2 and vernier, uses unix signals to interrupt the currently-active thread and take a sample (usually `SIGPROF`). When some system calls get interrupted by a signal, they return an [EINTR error code](https://man7.org/linux/man-pages/man7/signal.7.html#:~:text=Interruption%20of%20system%20calls%20and%20library%20functions%20by%20signal%20handlers) back to the caller.
Consider for instance the implementation of `dir_each_entry` in `dir.c`:
```c
static VALUE
dir_each_entry(VALUE dir, VALUE (*each)(VALUE, VALUE), VALUE arg, int children_only)
{
struct dir_data *dirp;
struct dirent *dp;
IF_NORMALIZE_UTF8PATH(int norm_p);
GetDIR(dir, dirp);
rewinddir(dirp->dir);
IF_NORMALIZE_UTF8PATH(norm_p = need_normalization(dirp->dir, RSTRING_PTR(dirp->path)));
while ((dp = READDIR(dirp->dir, dirp->enc)) != NULL) {
// ... do things
}
return dir;
}
```
If `READDIR` returns `NULL`, then `dir_each_entry` assumes it has iterated the entire directory. But looking [at the man page for `readdir`](https://man7.org/linux/man-pages/man3/readdir.3.html) we see the following sharp edge (emphasis mine):
> It returns NULL on reaching the end of the directory stream **or if an error occurred**.
So what's happening in this situation is: `readdir` gets interrupted, returns `NULL` + sets errno to `EINTR`. But `dir_each_entry` doesn't check `errno`, so rather than raising an exception to flag the issue, it treats it as if the end of the directory has been reached.
#### How to reproduce
Reproducing this is somewhat annoying, because it's dependent on timing: the signal must arrive at the exact time the dir API is getting executed.
I was able to reproduce this every time by using the [google cloud `gcsfuse`](https://cloud.google.com/storage/docs/gcs-fuse) tool. This somewhat makes sense -- a remote filesystem is much slower than a local one, so there's a much bigger window of opportunity for a signal to arrive while the system call is blocked.
Here's an example I included in https://github.com/DataDog/dd-trace-rb/pull/3720:
```
# Not shown: Set up a trial google cloud account, install gcsfuse, create a cloud storage bucket and put in some test files
$ gcsfuse test_fs_dd_trace_rb fuse-testing/
$ ls fuse-testing/
hello.txt test.html test.txt
# Not shown: Add `datadog` gem to `Gemfile`
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
[]
```
Let me know if you'd like me to try to create a reproducer that does not depend on the `datadog` gem.
#### Additional notes
I've spent quite some time looking at the `dir.c` sources, and here's the full list of APIs that suffer from issues:
* `dir_each_entry` does not check `errno`; all of its users have interruption bugs
* `dir_tell` will return -1 instead of the correct position (which means that passing -1 to `dir_seek`/`dir_set_pos` will cause it to not list the directory properly)
* `do_opendir` an error in system calls will only sometimes be turned into a raised exception
* Indirect callers that pass in rb_glob_error as errfunc: rb_glob, Dir.[], Dir.glob
* Indirect callers that pass in 0 as errfunc: ruby_glob, ruby_brace_glob
* `glob_opendir` does not check errno; all of its users have interruption bugs
* `glob_getent` does not check errno; all of its users have interruption bugs
* `nogvl_dir_empty_p` does not check errno (of readdir! it actually checks for opendir); all of its users have interruption bugs
Less sure about these:
* `do_stat`/`do_lstat` will turn errors into warnings (unclear if enabled or disabled by default)
* `need_normalization` calls `fgetattrlist` / `getattrlist` and all errors `(ret != 0)` are treated in the same way
* `rb_glob_error` is and `rb_glob_caller` leave exceptions as pending and rely on callers to raise them properly
* Error handling of `rb_home_dir_of` and `rb_default_home_dir` are a bit suspicious
As a workaround in the Datadog Ruby profiler, in https://github.com/DataDog/dd-trace-rb/pull/3720 I've added monkey patches to all of the Ruby-level APIs that use the above functions and mask out `SIGPROF` so these calls are never interrupted.
This solution is does successfully work around the issue, although it prevents the profiler from sampling during these system calls, which will mean less visibility if e.g. these calls are taking a long time. And well, maintaining monkey patches is always problematic for future Ruby compatibility.
---Files--------------------------------
readdir-bug-repro.c (2.11 KB)
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* [ruby-core:118919] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
2024-06-19 10:10 [ruby-core:118346] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted ivoanjo (Ivo Anjo) via ruby-core
` (9 preceding siblings ...)
2024-08-21 22:52 ` [ruby-core:118911] " jeremyevans0 (Jeremy Evans) via ruby-core
@ 2024-08-22 8:03 ` ivoanjo (Ivo Anjo) via ruby-core
10 siblings, 0 replies; 13+ messages in thread
From: ivoanjo (Ivo Anjo) via ruby-core @ 2024-08-22 8:03 UTC (permalink / raw)
To: ruby-core; +Cc: ivoanjo (Ivo Anjo)
Issue #20586 has been updated by ivoanjo (Ivo Anjo).
jeremyevans0 (Jeremy Evans) wrote in #note-11:
>
> I've submitted a pull request to add error checking to `getlogin` to `rb_home_dir_of` and `PTY.spawn`: https://github.com/ruby/ruby/pull/11427
>
> I reviewed `getlogin` usage in `etc` and it already handles errors, falling back to `USER` environment variable:
>
> ```c
> login = getlogin();
> if (!login) login = getenv("USER");
> ```
👍👍👍
The "what's the worst thing that can happen" spidy sense makes me think that automatically falling back on errors that could be transient could mean that such calls have the potential to behave differently once-in-a-blue-moon (e.g. when the fallback kicks in).
On the other hand raising on error can mean failure on non-transient errors (and which ones are transient or not??) as well as rare exceptions being raised once-in-a-blue-moon, which does not seem very nice either, so I think the current version is reasonable.
----------------------------------------
Bug #20586: Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
https://bugs.ruby-lang.org/issues/20586#change-109489
* Author: ivoanjo (Ivo Anjo)
* Status: Open
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
#### Background
Hey! I work for Datadog on the Ruby profiler part of the [`datadog` (previously `ddtrace`)](https://github.com/datadog/dd-trace-rb) gem.
A customer reached [out with an issue](https://github.com/DataDog/dd-trace-rb/issues/3450) where enabling the profiler made `Dir.glob` return no files for a given folder:
Without profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> ["/gcsfuse/test.html", "/gcsfuse/test.txt"]
```
With profiler:
```
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> []
```
It turns out the issue is related to missing error handling in `dir.c`.
The Datadog Ruby profiler, like stackprof, pf2 and vernier, uses unix signals to interrupt the currently-active thread and take a sample (usually `SIGPROF`). When some system calls get interrupted by a signal, they return an [EINTR error code](https://man7.org/linux/man-pages/man7/signal.7.html#:~:text=Interruption%20of%20system%20calls%20and%20library%20functions%20by%20signal%20handlers) back to the caller.
Consider for instance the implementation of `dir_each_entry` in `dir.c`:
```c
static VALUE
dir_each_entry(VALUE dir, VALUE (*each)(VALUE, VALUE), VALUE arg, int children_only)
{
struct dir_data *dirp;
struct dirent *dp;
IF_NORMALIZE_UTF8PATH(int norm_p);
GetDIR(dir, dirp);
rewinddir(dirp->dir);
IF_NORMALIZE_UTF8PATH(norm_p = need_normalization(dirp->dir, RSTRING_PTR(dirp->path)));
while ((dp = READDIR(dirp->dir, dirp->enc)) != NULL) {
// ... do things
}
return dir;
}
```
If `READDIR` returns `NULL`, then `dir_each_entry` assumes it has iterated the entire directory. But looking [at the man page for `readdir`](https://man7.org/linux/man-pages/man3/readdir.3.html) we see the following sharp edge (emphasis mine):
> It returns NULL on reaching the end of the directory stream **or if an error occurred**.
So what's happening in this situation is: `readdir` gets interrupted, returns `NULL` + sets errno to `EINTR`. But `dir_each_entry` doesn't check `errno`, so rather than raising an exception to flag the issue, it treats it as if the end of the directory has been reached.
#### How to reproduce
Reproducing this is somewhat annoying, because it's dependent on timing: the signal must arrive at the exact time the dir API is getting executed.
I was able to reproduce this every time by using the [google cloud `gcsfuse`](https://cloud.google.com/storage/docs/gcs-fuse) tool. This somewhat makes sense -- a remote filesystem is much slower than a local one, so there's a much bigger window of opportunity for a signal to arrive while the system call is blocked.
Here's an example I included in https://github.com/DataDog/dd-trace-rb/pull/3720:
```
# Not shown: Set up a trial google cloud account, install gcsfuse, create a cloud storage bucket and put in some test files
$ gcsfuse test_fs_dd_trace_rb fuse-testing/
$ ls fuse-testing/
hello.txt test.html test.txt
# Not shown: Add `datadog` gem to `Gemfile`
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
[]
```
Let me know if you'd like me to try to create a reproducer that does not depend on the `datadog` gem.
#### Additional notes
I've spent quite some time looking at the `dir.c` sources, and here's the full list of APIs that suffer from issues:
* `dir_each_entry` does not check `errno`; all of its users have interruption bugs
* `dir_tell` will return -1 instead of the correct position (which means that passing -1 to `dir_seek`/`dir_set_pos` will cause it to not list the directory properly)
* `do_opendir` an error in system calls will only sometimes be turned into a raised exception
* Indirect callers that pass in rb_glob_error as errfunc: rb_glob, Dir.[], Dir.glob
* Indirect callers that pass in 0 as errfunc: ruby_glob, ruby_brace_glob
* `glob_opendir` does not check errno; all of its users have interruption bugs
* `glob_getent` does not check errno; all of its users have interruption bugs
* `nogvl_dir_empty_p` does not check errno (of readdir! it actually checks for opendir); all of its users have interruption bugs
Less sure about these:
* `do_stat`/`do_lstat` will turn errors into warnings (unclear if enabled or disabled by default)
* `need_normalization` calls `fgetattrlist` / `getattrlist` and all errors `(ret != 0)` are treated in the same way
* `rb_glob_error` is and `rb_glob_caller` leave exceptions as pending and rely on callers to raise them properly
* Error handling of `rb_home_dir_of` and `rb_default_home_dir` are a bit suspicious
As a workaround in the Datadog Ruby profiler, in https://github.com/DataDog/dd-trace-rb/pull/3720 I've added monkey patches to all of the Ruby-level APIs that use the above functions and mask out `SIGPROF` so these calls are never interrupted.
This solution is does successfully work around the issue, although it prevents the profiler from sampling during these system calls, which will mean less visibility if e.g. these calls are taking a long time. And well, maintaining monkey patches is always problematic for future Ruby compatibility.
---Files--------------------------------
readdir-bug-repro.c (2.11 KB)
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-08-22 8:04 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-19 10:10 [ruby-core:118346] [Ruby master Bug#20586] Some filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted ivoanjo (Ivo Anjo) via ruby-core
2024-06-19 10:17 ` [ruby-core:118348] " ivoanjo (Ivo Anjo) via ruby-core
2024-06-19 11:11 ` [ruby-core:118350] " mame (Yusuke Endoh) via ruby-core
2024-06-19 14:04 ` [ruby-core:118354] " ivoanjo (Ivo Anjo) via ruby-core
2024-06-19 19:57 ` [ruby-core:118358] " Eregon (Benoit Daloze) via ruby-core
2024-06-20 5:02 ` [ruby-core:118359] " mame (Yusuke Endoh) via ruby-core
2024-06-20 9:53 ` [ruby-core:118364] " ivoanjo (Ivo Anjo) via ruby-core
2024-08-16 21:59 ` [ruby-core:118861] " jeremyevans0 (Jeremy Evans) via ruby-core
2024-08-18 3:46 ` [ruby-core:118874] " Alan D. Salewski via ruby-core
2024-08-19 9:19 ` [ruby-core:118890] " ivoanjo (Ivo Anjo) via ruby-core
2024-08-19 22:32 ` [ruby-core:118898] " jeremyevans0 (Jeremy Evans) via ruby-core
2024-08-21 22:52 ` [ruby-core:118911] " jeremyevans0 (Jeremy Evans) via ruby-core
2024-08-22 8:03 ` [ruby-core:118919] " ivoanjo (Ivo Anjo) via ruby-core
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).