mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] Backwards kernel compatibility
@ 2021-05-10  5:50 Martin Vajnar
  2021-05-10  6:46 ` Florian Weimer
  2021-05-10 18:58 ` Markus Wichmann
  0 siblings, 2 replies; 13+ messages in thread
From: Martin Vajnar @ 2021-05-10  5:50 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 414 bytes --]

Hello guys,

I'd like to ask, if it is generally supported to run recent musl on older
kernels? My primary concern is that there are new syscalls being added to
linux, while at the same time I do not see a switch similar to glibc's to
select compatibility mode (--enable-kernel). Is there some means which
prevent invocation of unimplemented syscalls on older kernels when using
musl?

Best regards,
Martin Vajnar

[-- Attachment #2: Type: text/html, Size: 589 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [musl] Backwards kernel compatibility
  2021-05-10  5:50 [musl] Backwards kernel compatibility Martin Vajnar
@ 2021-05-10  6:46 ` Florian Weimer
  2021-05-10 18:58 ` Markus Wichmann
  1 sibling, 0 replies; 13+ messages in thread
From: Florian Weimer @ 2021-05-10  6:46 UTC (permalink / raw)
  To: Martin Vajnar; +Cc: musl

* Martin Vajnar:

> I'd like to ask, if it is generally supported to run recent musl on
> older kernels? My primary concern is that there are new syscalls being
> added to linux, while at the same time I do not see a switch similar
> to glibc's to select compatibility mode (--enable-kernel).

--enable-kernel is used to *remove* compatibility with older kernels,
not add it, so it does the opposite what you want.

I believe musl has greater compatibility with older kernels than glibc.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [musl] Backwards kernel compatibility
  2021-05-10  5:50 [musl] Backwards kernel compatibility Martin Vajnar
  2021-05-10  6:46 ` Florian Weimer
@ 2021-05-10 18:58 ` Markus Wichmann
  2021-05-24 13:52   ` Martin Vajnar
  1 sibling, 1 reply; 13+ messages in thread
From: Markus Wichmann @ 2021-05-10 18:58 UTC (permalink / raw)
  To: musl

On Mon, May 10, 2021 at 07:50:44AM +0200, Martin Vajnar wrote:
> Hello guys,
>
> I'd like to ask, if it is generally supported to run recent musl on older
> kernels? My primary concern is that there are new syscalls being added to
> linux, while at the same time I do not see a switch similar to glibc's to
> select compatibility mode (--enable-kernel). Is there some means which
> prevent invocation of unimplemented syscalls on older kernels when using
> musl?
>
> Best regards,
> Martin Vajnar

In general, musl tries to support all kernel versions from 2.6.0 on. If
you call a newer system call on a kernel that doesn't support it, you
will get ENOSYS back, but all algorithms implemented in the library will
fall back to that smallest common denominator (and some things even
further).

There is no way to prevent calls to new system calls on older kernels,
since the kernel already takes care of that.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [musl] Backwards kernel compatibility
  2021-05-10 18:58 ` Markus Wichmann
@ 2021-05-24 13:52   ` Martin Vajnar
  2021-05-24 22:00     ` Rich Felker
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Vajnar @ 2021-05-24 13:52 UTC (permalink / raw)
  To: musl; +Cc: Markus Wichmann, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 1976 bytes --]

Hi, Markus,

sorry for the late reply it was quite busy lately. You're describing
exactly the issue, we are facing in our project. We need to use old kernel
which we have only in binary form and have headers for it. At the same time
we would like to have the latest musl running on it.

The problem we encounter is that for unsupported (or better said, not
supported yet) syscalls we get performance overhead because of the ENOSYS.

We see 2 options to approach this:

 1. remove the syscalls manually/alter the code to not invoke them (hacky)
 2. during musl compile time (maybe even configure-time), parse the
supplied kernel headers and based on availability of syscalls use defines
to steer the code execution (more universal)

Would the 2nd case be something that musl community would be interested in,
should we choose to implement it for the project?


Regards,
Martin

po 10. 5. 2021 v 20:58 odesílatel Markus Wichmann <nullplan@gmx.net> napsal:

> On Mon, May 10, 2021 at 07:50:44AM +0200, Martin Vajnar wrote:
> > Hello guys,
> >
> > I'd like to ask, if it is generally supported to run recent musl on older
> > kernels? My primary concern is that there are new syscalls being added to
> > linux, while at the same time I do not see a switch similar to glibc's to
> > select compatibility mode (--enable-kernel). Is there some means which
> > prevent invocation of unimplemented syscalls on older kernels when using
> > musl?
> >
> > Best regards,
> > Martin Vajnar
>
> In general, musl tries to support all kernel versions from 2.6.0 on. If
> you call a newer system call on a kernel that doesn't support it, you
> will get ENOSYS back, but all algorithms implemented in the library will
> fall back to that smallest common denominator (and some things even
> further).
>
> There is no way to prevent calls to new system calls on older kernels,
> since the kernel already takes care of that.
>
> Ciao,
> Markus
>

[-- Attachment #2: Type: text/html, Size: 2392 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [musl] Backwards kernel compatibility
  2021-05-24 13:52   ` Martin Vajnar
@ 2021-05-24 22:00     ` Rich Felker
  2021-06-02  7:38       ` Martin Vajnar
  2021-06-08 22:16       ` Martin Vajnar
  0 siblings, 2 replies; 13+ messages in thread
From: Rich Felker @ 2021-05-24 22:00 UTC (permalink / raw)
  To: Martin Vajnar; +Cc: musl, Markus Wichmann, Florian Weimer

On Mon, May 24, 2021 at 03:52:44PM +0200, Martin Vajnar wrote:
> Hi, Markus,
> 
> sorry for the late reply it was quite busy lately. You're describing
> exactly the issue, we are facing in our project. We need to use old kernel
> which we have only in binary form and have headers for it. At the same time
> we would like to have the latest musl running on it.
> 
> The problem we encounter is that for unsupported (or better said, not
> supported yet) syscalls we get performance overhead because of the ENOSYS.

Can you give some information on what syscalls these are and if/how
you measured the performance overhead as being significant?

> We see 2 options to approach this:
> 
>  1. remove the syscalls manually/alter the code to not invoke them (hacky)
>  2. during musl compile time (maybe even configure-time), parse the
> supplied kernel headers and based on availability of syscalls use defines
> to steer the code execution (more universal)
> 
> Would the 2nd case be something that musl community would be interested in,
> should we choose to implement it for the project?

No, but hopefully there's a third option: identify whatever place the
fallback is actual a performance bottleneck and do what we can to
mitigate it. If it's really bad, saving the result might be an option,
but we've tried to avoid that both for complexity reasons and because
it could preclude fixing serious problems (like Y2038 EOL) by
live-migrating processes to a newer kernel with new syscalls that
avoid the bug. A better approach is just using the "oldest" syscall
that can actually do the job, which we already try to do in most
places in musl, only relying on the newer one for inputs that require
it. However this is not possible for functions that read back a time,
since the input is external (e.g. the system clock or the filesystem)
and it's not known in advance whether the old syscall could represent
the result.

It *might* be plausible to memorize the result "new syscall not
available" but drop that memory whenever we see a result that
indicates a failure due to use of the outdated syscall. We're kinda
already doing that with the vdso clock_gettime -- cgt_time32_wrap
disables itself if it ever sees a negative value for seconds.

An alternative approach, especially if this is a matter of time64, to
avoid nonstandard binaries that would be non-future-proof, might be to
patch your kernel with a loadable module that adds dumb translation
layers for the syscalls that are performance bottlenecks.

Rich

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [musl] Backwards kernel compatibility
  2021-05-24 22:00     ` Rich Felker
@ 2021-06-02  7:38       ` Martin Vajnar
  2021-06-02 11:52         ` Arnd Bergmann
  2021-06-08 22:16       ` Martin Vajnar
  1 sibling, 1 reply; 13+ messages in thread
From: Martin Vajnar @ 2021-06-02  7:38 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl, Markus Wichmann, Florian Weimer

Hi Rich,

thank you for such detailed reply.

út 25. 5. 2021 v 0:00 odesílatel Rich Felker <dalias@libc.org> napsal:
>
> On Mon, May 24, 2021 at 03:52:44PM +0200, Martin Vajnar wrote:
> > Hi, Markus,
> >
> > sorry for the late reply it was quite busy lately. You're describing
> > exactly the issue, we are facing in our project. We need to use old kernel
> > which we have only in binary form and have headers for it. At the same time
> > we would like to have the latest musl running on it.
> >
> > The problem we encounter is that for unsupported (or better said, not
> > supported yet) syscalls we get performance overhead because of the ENOSYS.
>
> Can you give some information on what syscalls these are and if/how
> you measured the performance overhead as being significant?


The main source of overhead comes from the kernel 4.4 which on arm64
produces stack traces when not implemented syscall is invoked:

    https://github.com/torvalds/linux/blob/afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc/arch/arm64/kernel/traps.c#L369

While the kernel is dumping there is noticeable slow down in system response
(in the below case) as the dumping sometimes lasts up to tens of miliseconds
(in the below example the dropbear is running in AArch32 mode):

    [90276.609777] dropbear[29686]: syscall 403
    [90276.611310] Code: 4620e02b f2404629 463a1393 df00461f (f1104617)
    [90276.615265] CPU: 2 PID: 29686 Comm: dropbear Tainted: P
       4.4.60 #1
    [90276.621212] Hardware name: board.A
    [90276.628688] task: ffffffc029fc3700 ti: ffffffc029fc3700
task.ti: ffffffc029fc3700
    [90276.633277] PC is at 0x57168
    [90276.640748] LR is at 0x22223
    [90276.643683] pc : [<0000000000057168>] lr : [<0000000000022223>]
pstate: 20000030
    [90276.646569] sp : 00000000ff94f118
    [90276.653933] x12: 0000000000000007
    [90276.660429] x11: 0000000000000000 x10: 0000000000000000
    [90276.665830] x9 : 0000000000000000 x8 : 0000000000000000
    [90276.678746] x7 : 0000000000000193 x6 : 0000000000010325
    [90276.684039] x5 : 00000000ff94f158 x4 : 0000000000000001
    [90276.689337] x3 : 0000000000000193 x2 : 00000000ff94f130
    [90276.694629] x1 : 00000000ff94f158 x0 : 0000000000000001

> > We see 2 options to approach this:
> >
> >  1. remove the syscalls manually/alter the code to not invoke them (hacky)
> >  2. during musl compile time (maybe even configure-time), parse the
> > supplied kernel headers and based on availability of syscalls use defines
> > to steer the code execution (more universal)
> >
> > Would the 2nd case be something that musl community would be interested in,
> > should we choose to implement it for the project?
>
> No, but hopefully there's a third option: identify whatever place the
> fallback is actual a performance bottleneck and do what we can to
> mitigate it. If it's really bad, saving the result might be an option,
> but we've tried to avoid that both for complexity reasons and because
> it could preclude fixing serious problems (like Y2038 EOL) by
> live-migrating processes to a newer kernel with new syscalls that
> avoid the bug. A better approach is just using the "oldest" syscall
> that can actually do the job, which we already try to do in most
> places in musl, only relying on the newer one for inputs that require
> it. However this is not possible for functions that read back a time,
> since the input is external (e.g. the system clock or the filesystem)
> and it's not known in advance whether the old syscall could represent
> the result.

Yes, I think this is the case as I only saw the stack traces for syscalls
403 (__NR_clock_gettime64) and 397 (__NR_statx).

> It *might* be plausible to memorize the result "new syscall not
> available" but drop that memory whenever we see a result that
> indicates a failure due to use of the outdated syscall. We're kinda
> already doing that with the vdso clock_gettime -- cgt_time32_wrap
> disables itself if it ever sees a negative value for seconds.

Thanks for suggestion. I think this would solve the issue I'm experiencing
and it would definitely be cleaner, than my original idea. It would
significantly
decrease the number of traces produced.

I can prepare patches implementing this for the stat*() and clock_gettime()
and see how it looks compared to the below approach.

> An alternative approach, especially if this is a matter of time64, to
> avoid nonstandard binaries that would be non-future-proof, might be to
> patch your kernel with a loadable module that adds dumb translation
> layers for the syscalls that are performance bottlenecks.
>
> Rich

Regards,
Martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [musl] Backwards kernel compatibility
  2021-06-02  7:38       ` Martin Vajnar
@ 2021-06-02 11:52         ` Arnd Bergmann
  2021-06-02 14:56           ` Rich Felker
  2021-06-09  7:03           ` Arnd Bergmann
  0 siblings, 2 replies; 13+ messages in thread
From: Arnd Bergmann @ 2021-06-02 11:52 UTC (permalink / raw)
  To: musl; +Cc: Rich Felker, Markus Wichmann, Florian Weimer

On Wed, Jun 2, 2021 at 9:38 AM Martin Vajnar <martin.vajnar@gmail.com> wrote:
>
> Hi Rich,
>
> thank you for such detailed reply.> Cc: <stable@vger.kernel.org>
> Cc: Martin Vajnar <martin.vajnar@gmail.com>
> Cc: musl@lists.openwall.com
> Acked-by: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Michael Weiser <michael.weiser@gmx.de>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> This was backported to v4.14 and later, but is missing in v4.4 and
> before, apparently because of a trivial merge conflict. This is
> a manual backport I did after I saw a report about the issue
> by Martin Vajnar on the musl mailing list.
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  arch/arm64/kernel/traps.c | 8 --------
>  1 file changed, 8 deletions(-)
>
> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index 9322be69ca09..db4163808c76 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -363,14 +363,6 @@ asmlinkage long do_ni_syscall(struct pt_regs *regs)
>         }
>  #endif
>
> -       if (show_unhandled_signals && printk_ratelimit()) {
> -               pr_info("%s[%d]: syscall %d\n", current->comm,
> -                       task_pid_nr(current), (int)regs->syscallno);
> -               dump_instr("", regs);
> -               if (user_mode(regs))
> -                       __show_regs(regs);
> -       }
> -
>         return sys_ni_syscall();
>  }
>
> --
> 2.29.2
>
>
> út 25. 5. 2021 v 0:00 odesílatel Rich Felker <dalias@libc.org> napsal:
> >
> > On Mon, May 24, 2021 at 03:52:44PM +0200, Martin Vajnar wrote:
> > > Hi, Markus,
> > >
> > > sorry for the late reply it was quite busy lately. You're describing
> > > exactly the issue, we are facing in our project. We need to use old kernel
> > > which we have only in binary form and have headers for it. At the same time
> > > we would like to have the latest musl running on it.
> > >
> > > The problem we encounter is that for unsupported (or better said, not
> > > supported yet) syscalls we get performance overhead because of the ENOSYS.
> >
> > Can you give some information on what syscalls these are and if/how
> > you measured the performance overhead as being significant?
>
>
> The main source of overhead comes from the kernel 4.4 which on arm64
> produces stack traces when not implemented syscall is invoked:
>
>     https://github.com/torvalds/linux/blob/afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc/arch/arm64/kernel/traps.c#L369

That is clearly a bug that was fixed in mainline and backported to linux-4.14
but not 4.4 or 4.9. I've sent a manual backport for inclusion in those kernels
now.

       Arnd

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [musl] Backwards kernel compatibility
  2021-06-02 11:52         ` Arnd Bergmann
@ 2021-06-02 14:56           ` Rich Felker
  2021-06-02 16:01             ` Arnd Bergmann
  2021-06-09  7:03           ` Arnd Bergmann
  1 sibling, 1 reply; 13+ messages in thread
From: Rich Felker @ 2021-06-02 14:56 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: musl, Markus Wichmann, Florian Weimer

On Wed, Jun 02, 2021 at 01:52:43PM +0200, Arnd Bergmann wrote:
> On Wed, Jun 2, 2021 at 9:38 AM Martin Vajnar <martin.vajnar@gmail.com> wrote:
> >
> > Hi Rich,
> >
> > thank you for such detailed reply.> Cc: <stable@vger.kernel.org>
> > Cc: Martin Vajnar <martin.vajnar@gmail.com>
> > Cc: musl@lists.openwall.com
> > Acked-by: Will Deacon <will.deacon@arm.com>
> > Signed-off-by: Michael Weiser <michael.weiser@gmx.de>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> > ---
> > This was backported to v4.14 and later, but is missing in v4.4 and
> > before, apparently because of a trivial merge conflict. This is
> > a manual backport I did after I saw a report about the issue
> > by Martin Vajnar on the musl mailing list.
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> > ---
> >  arch/arm64/kernel/traps.c | 8 --------
> >  1 file changed, 8 deletions(-)
> >
> > diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> > index 9322be69ca09..db4163808c76 100644
> > --- a/arch/arm64/kernel/traps.c
> > +++ b/arch/arm64/kernel/traps.c
> > @@ -363,14 +363,6 @@ asmlinkage long do_ni_syscall(struct pt_regs *regs)
> >         }
> >  #endif
> >
> > -       if (show_unhandled_signals && printk_ratelimit()) {
> > -               pr_info("%s[%d]: syscall %d\n", current->comm,
> > -                       task_pid_nr(current), (int)regs->syscallno);
> > -               dump_instr("", regs);
> > -               if (user_mode(regs))
> > -                       __show_regs(regs);
> > -       }
> > -
> >         return sys_ni_syscall();
> >  }
> >
> > --
> > 2.29.2
> >
> >
> > út 25. 5. 2021 v 0:00 odesílatel Rich Felker <dalias@libc.org> napsal:
> > >
> > > On Mon, May 24, 2021 at 03:52:44PM +0200, Martin Vajnar wrote:
> > > > Hi, Markus,
> > > >
> > > > sorry for the late reply it was quite busy lately. You're describing
> > > > exactly the issue, we are facing in our project. We need to use old kernel
> > > > which we have only in binary form and have headers for it. At the same time
> > > > we would like to have the latest musl running on it.
> > > >
> > > > The problem we encounter is that for unsupported (or better said, not
> > > > supported yet) syscalls we get performance overhead because of the ENOSYS.
> > >
> > > Can you give some information on what syscalls these are and if/how
> > > you measured the performance overhead as being significant?
> >
> >
> > The main source of overhead comes from the kernel 4.4 which on arm64
> > produces stack traces when not implemented syscall is invoked:
> >
> >     https://github.com/torvalds/linux/blob/afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc/arch/arm64/kernel/traps.c#L369
> 
> That is clearly a bug that was fixed in mainline and backported to linux-4.14
> but not 4.4 or 4.9. I've sent a manual backport for inclusion in those kernels
> now.

Is this practical to hotpatch into kernels on devices that aren't
readily upgradable?

Rich

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [musl] Backwards kernel compatibility
  2021-06-02 14:56           ` Rich Felker
@ 2021-06-02 16:01             ` Arnd Bergmann
  2021-06-02 16:18               ` Arnd Bergmann
  0 siblings, 1 reply; 13+ messages in thread
From: Arnd Bergmann @ 2021-06-02 16:01 UTC (permalink / raw)
  To: musl; +Cc: Markus Wichmann, Florian Weimer

On Wed, Jun 2, 2021 at 4:56 PM Rich Felker <dalias@libc.org> wrote:
> On Wed, Jun 02, 2021 at 01:52:43PM +0200, Arnd Bergmann wrote:
> > >
> > > The main source of overhead comes from the kernel 4.4 which on arm64
> > > produces stack traces when not implemented syscall is invoked:
> > >
> > >     https://github.com/torvalds/linux/blob/afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc/arch/arm64/kernel/traps.c#L369
> >
> > That is clearly a bug that was fixed in mainline and backported to linux-4.14
> > but not 4.4 or 4.9. I've sent a manual backport for inclusion in those kernels
> > now.
>
> Is this practical to hotpatch into kernels on devices that aren't
> readily upgradable?

Including the patch in a source tree is trivial, as it just removes a few lines
of (misguided) output. If you are asking about run-time patching it out of
a running kernel using kpatch/kGraft/ksplice, this would also be doable
by patching out the branch in that function, but the infrastructure for live
patching kernels is likely missing on most of the systems that lack a way to
replace the kernel image, so in practice it would not help.

       Arnd

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [musl] Backwards kernel compatibility
  2021-06-02 16:01             ` Arnd Bergmann
@ 2021-06-02 16:18               ` Arnd Bergmann
  0 siblings, 0 replies; 13+ messages in thread
From: Arnd Bergmann @ 2021-06-02 16:18 UTC (permalink / raw)
  To: musl; +Cc: Markus Wichmann, Florian Weimer

On Wed, Jun 2, 2021 at 6:01 PM Arnd Bergmann <arnd@kernel.org> wrote:
>
> On Wed, Jun 2, 2021 at 4:56 PM Rich Felker <dalias@libc.org> wrote:
> > On Wed, Jun 02, 2021 at 01:52:43PM +0200, Arnd Bergmann wrote:
> > > >
> > > > The main source of overhead comes from the kernel 4.4 which on arm64
> > > > produces stack traces when not implemented syscall is invoked:
> > > >
> > > >     https://github.com/torvalds/linux/blob/afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc/arch/arm64/kernel/traps.c#L369
> > >
> > > That is clearly a bug that was fixed in mainline and backported to linux-4.14
> > > but not 4.4 or 4.9. I've sent a manual backport for inclusion in those kernels
> > > now.
> >
> > Is this practical to hotpatch into kernels on devices that aren't
> > readily upgradable?
>
> Including the patch in a source tree is trivial, as it just removes a few lines
> of (misguided) output. If you are asking about run-time patching it out of
> a running kernel using kpatch/kGraft/ksplice, this would also be doable
> by patching out the branch in that function, but the infrastructure for live
> patching kernels is likely missing on most of the systems that lack a way to
> replace the kernel image, so in practice it would not help.

I found one more thing: The warning is controlled by
/proc/sys/debug/exception-trace, writing a zero into that file
disables it, along with the output for unhandled signals that kill
a process.

Starting with linux-4.4.19/linux-4.9.84/linux-4.14.23/linux-4.16, the
output is already disabled by default, in earlier arm64 kernels it
is enabled. The patch to disable this was merged upstream at the
same time as the one that removes the unhandled-syscall
warning, but the older kernels (4.4 and 4.9) were missing the
backport of that second patch.

On other architectures, this sysctl never controlled printing
the unhandled syscalls, only unhandled signals, but it remains
enabled by default.

       Arnd

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [musl] Backwards kernel compatibility
  2021-05-24 22:00     ` Rich Felker
  2021-06-02  7:38       ` Martin Vajnar
@ 2021-06-08 22:16       ` Martin Vajnar
  2021-06-09  0:37         ` Rich Felker
  1 sibling, 1 reply; 13+ messages in thread
From: Martin Vajnar @ 2021-06-08 22:16 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl, Markus Wichmann, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 3195 bytes --]

Hi, Rich,

út 25. 5. 2021 v 0:00 odesílatel Rich Felker <dalias@libc.org> napsal:
>
> On Mon, May 24, 2021 at 03:52:44PM +0200, Martin Vajnar wrote:
> > Hi, Markus,
> >
> > sorry for the late reply it was quite busy lately. You're describing
> > exactly the issue, we are facing in our project. We need to use old kernel
> > which we have only in binary form and have headers for it. At the same time
> > we would like to have the latest musl running on it.
> >
> > The problem we encounter is that for unsupported (or better said, not
> > supported yet) syscalls we get performance overhead because of the ENOSYS.
>
> Can you give some information on what syscalls these are and if/how
> you measured the performance overhead as being significant?
>
> > We see 2 options to approach this:
> >
> >  1. remove the syscalls manually/alter the code to not invoke them (hacky)
> >  2. during musl compile time (maybe even configure-time), parse the
> > supplied kernel headers and based on availability of syscalls use defines
> > to steer the code execution (more universal)
> >
> > Would the 2nd case be something that musl community would be interested in,
> > should we choose to implement it for the project?
>
> No, but hopefully there's a third option: identify whatever place the
> fallback is actual a performance bottleneck and do what we can to
> mitigate it. If it's really bad, saving the result might be an option,
> but we've tried to avoid that both for complexity reasons and because
> it could preclude fixing serious problems (like Y2038 EOL) by
> live-migrating processes to a newer kernel with new syscalls that
> avoid the bug. A better approach is just using the "oldest" syscall
> that can actually do the job, which we already try to do in most
> places in musl, only relying on the newer one for inputs that require
> it. However this is not possible for functions that read back a time,
> since the input is external (e.g. the system clock or the filesystem)
> and it's not known in advance whether the old syscall could represent
> the result.
>
> It *might* be plausible to memorize the result "new syscall not
> available" but drop that memory whenever we see a result that
> indicates a failure due to use of the outdated syscall. We're kinda
> already doing that with the vdso clock_gettime -- cgt_time32_wrap
> disables itself if it ever sees a negative value for seconds.

Since updating the kernel is not an option for me, I prepared patch
implementing memorizing failed syscall attempt on the time64 variants
and on statx syscall, so on next attempt it will skip them and use
fallback directly. Attaching the patch in case someone is solving the
same issue. Please, let me know, if this approach would be something
interesting for upstreaming and if so, if there are any changes I
should make.

> An alternative approach, especially if this is a matter of time64, to
> avoid nonstandard binaries that would be non-future-proof, might be to
> patch your kernel with a loadable module that adds dumb translation
> layers for the syscalls that are performance bottlenecks.
>
> Rich

Regards,
Martin

[-- Attachment #2: 0001-store-fallback-paths-for-kernels-without-time64-stat.patch --]
[-- Type: text/x-patch, Size: 23861 bytes --]

From ceb2371ada673d258328a019691bff014ea3abca Mon Sep 17 00:00:00 2001
From: Martin Vajnar <martin.vajnar@gmail.com>
Date: Tue, 8 Jun 2021 23:36:11 +0200
Subject: [PATCH] store fallback paths for kernels without time64/statx
 syscalls

---
 src/ipc/semtimedop.c                 |  12 ++--
 src/linux/clock_adjtime.c            | 100 ++++++++++++++-------------
 src/linux/ppoll.c                    |  16 +++--
 src/linux/timerfd.c                  |  28 +++++---
 src/misc/getrusage.c                 |  28 ++++----
 src/mq/mq_timedreceive.c             |  14 ++--
 src/mq/mq_timedsend.c                |  14 ++--
 src/network/recvmmsg.c               |  13 ++--
 src/select/pselect.c                 |  14 ++--
 src/select/select.c                  |  16 +++--
 src/signal/sigtimedwait.c            |  14 ++--
 src/stat/fstatat.c                   |  10 ++-
 src/stat/utimensat.c                 |  14 ++--
 src/thread/__timedwait.c             |  12 ++--
 src/thread/pthread_mutex_timedlock.c |  12 ++--
 src/time/clock_gettime.c             |  12 ++--
 src/time/clock_nanosleep.c           |  14 ++--
 src/time/clock_settime.c             |  14 ++--
 18 files changed, 217 insertions(+), 140 deletions(-)

diff --git a/src/ipc/semtimedop.c b/src/ipc/semtimedop.c
index 1632e7b0..d9fe404d 100644
--- a/src/ipc/semtimedop.c
+++ b/src/ipc/semtimedop.c
@@ -16,13 +16,17 @@
 int semtimedop(int id, struct sembuf *buf, size_t n, const struct timespec *ts)
 {
 #ifdef SYS_semtimedop_time64
+	static int use_semtimedop_time64 = 1;
 	time_t s = ts ? ts->tv_sec : 0;
 	long ns = ts ? ts->tv_nsec : 0;
 	int r = -ENOSYS;
-	if (NO_TIME32 || !IS32BIT(s))
-		r = __syscall(SYS_semtimedop_time64, id, buf, n,
-			ts ? ((long long[]){s, ns}) : 0);
-	if (NO_TIME32 || r!=-ENOSYS) return __syscall_ret(r);
+	if (use_semtimedop_time64) {
+		if (NO_TIME32 || !IS32BIT(s))
+			r = __syscall(SYS_semtimedop_time64, id, buf, n,
+				ts ? ((long long[]){s, ns}) : 0);
+		if (NO_TIME32 || r!=-ENOSYS) return __syscall_ret(r);
+		else use_semtimedop_time64 = 0;
+	}
 	ts = ts ? (void *)(long[]){CLAMP(s), ns} : 0;
 #endif
 #if defined(SYS_ipc)
diff --git a/src/linux/clock_adjtime.c b/src/linux/clock_adjtime.c
index d4d03d24..a4d43406 100644
--- a/src/linux/clock_adjtime.c
+++ b/src/linux/clock_adjtime.c
@@ -38,55 +38,59 @@ int clock_adjtime (clockid_t clock_id, struct timex *utx)
 {
 	int r = -ENOSYS;
 #ifdef SYS_clock_adjtime64
-	struct ktimex64 ktx = {
-		.modes = utx->modes,
-		.offset = utx->offset,
-		.freq = utx->freq,
-		.maxerror = utx->maxerror,
-		.esterror = utx->esterror,
-		.status = utx->status,
-		.constant = utx->constant,
-		.precision = utx->precision,
-		.tolerance = utx->tolerance,
-		.time_sec = utx->time.tv_sec,
-		.time_usec = utx->time.tv_usec,
-		.tick = utx->tick,
-		.ppsfreq = utx->ppsfreq,
-		.jitter = utx->jitter,
-		.shift = utx->shift,
-		.stabil = utx->stabil,
-		.jitcnt = utx->jitcnt,
-		.calcnt = utx->calcnt,
-		.errcnt = utx->errcnt,
-		.stbcnt = utx->stbcnt,
-		.tai = utx->tai,
-	};
-	r = __syscall(SYS_clock_adjtime64, clock_id, &ktx);
-	if (r>=0) {
-		utx->modes = ktx.modes;
-		utx->offset = ktx.offset;
-		utx->freq = ktx.freq;
-		utx->maxerror = ktx.maxerror;
-		utx->esterror = ktx.esterror;
-		utx->status = ktx.status;
-		utx->constant = ktx.constant;
-		utx->precision = ktx.precision;
-		utx->tolerance = ktx.tolerance;
-		utx->time.tv_sec = ktx.time_sec;
-		utx->time.tv_usec = ktx.time_usec;
-		utx->tick = ktx.tick;
-		utx->ppsfreq = ktx.ppsfreq;
-		utx->jitter = ktx.jitter;
-		utx->shift = ktx.shift;
-		utx->stabil = ktx.stabil;
-		utx->jitcnt = ktx.jitcnt;
-		utx->calcnt = ktx.calcnt;
-		utx->errcnt = ktx.errcnt;
-		utx->stbcnt = ktx.stbcnt;
-		utx->tai = ktx.tai;
+	static int use_clock_adjtime64 = 1;
+	if (use_clock_adjtime64) {
+		struct ktimex64 ktx = {
+			.modes = utx->modes,
+			.offset = utx->offset,
+			.freq = utx->freq,
+			.maxerror = utx->maxerror,
+			.esterror = utx->esterror,
+			.status = utx->status,
+			.constant = utx->constant,
+			.precision = utx->precision,
+			.tolerance = utx->tolerance,
+			.time_sec = utx->time.tv_sec,
+			.time_usec = utx->time.tv_usec,
+			.tick = utx->tick,
+			.ppsfreq = utx->ppsfreq,
+			.jitter = utx->jitter,
+			.shift = utx->shift,
+			.stabil = utx->stabil,
+			.jitcnt = utx->jitcnt,
+			.calcnt = utx->calcnt,
+			.errcnt = utx->errcnt,
+			.stbcnt = utx->stbcnt,
+			.tai = utx->tai,
+		};
+		r = __syscall(SYS_clock_adjtime64, clock_id, &ktx);
+		if (r>=0) {
+			utx->modes = ktx.modes;
+			utx->offset = ktx.offset;
+			utx->freq = ktx.freq;
+			utx->maxerror = ktx.maxerror;
+			utx->esterror = ktx.esterror;
+			utx->status = ktx.status;
+			utx->constant = ktx.constant;
+			utx->precision = ktx.precision;
+			utx->tolerance = ktx.tolerance;
+			utx->time.tv_sec = ktx.time_sec;
+			utx->time.tv_usec = ktx.time_usec;
+			utx->tick = ktx.tick;
+			utx->ppsfreq = ktx.ppsfreq;
+			utx->jitter = ktx.jitter;
+			utx->shift = ktx.shift;
+			utx->stabil = ktx.stabil;
+			utx->jitcnt = ktx.jitcnt;
+			utx->calcnt = ktx.calcnt;
+			utx->errcnt = ktx.errcnt;
+			utx->stbcnt = ktx.stbcnt;
+			utx->tai = ktx.tai;
+		}
+		if (SYS_clock_adjtime == SYS_clock_adjtime64 || r!=-ENOSYS)
+			return __syscall_ret(r);
+		else use_clock_adjtime64 = 0;
 	}
-	if (SYS_clock_adjtime == SYS_clock_adjtime64 || r!=-ENOSYS)
-		return __syscall_ret(r);
 	if ((utx->modes & ADJ_SETOFFSET) && !IS32BIT(utx->time.tv_sec))
 		return __syscall_ret(-ENOTSUP);
 #endif
diff --git a/src/linux/ppoll.c b/src/linux/ppoll.c
index e614600a..939f3f1d 100644
--- a/src/linux/ppoll.c
+++ b/src/linux/ppoll.c
@@ -12,13 +12,17 @@ int ppoll(struct pollfd *fds, nfds_t n, const struct timespec *to, const sigset_
 	time_t s = to ? to->tv_sec : 0;
 	long ns = to ? to->tv_nsec : 0;
 #ifdef SYS_ppoll_time64
+	static int use_ppoll_time64 = 1;
 	int r = -ENOSYS;
-	if (SYS_ppoll == SYS_ppoll_time64 || !IS32BIT(s))
-		r = __syscall_cp(SYS_ppoll_time64, fds, n,
-			to ? ((long long[]){s, ns}) : 0,
-			mask, _NSIG/8);
-	if (SYS_ppoll == SYS_ppoll_time64 || r != -ENOSYS)
-		return __syscall_ret(r);
+	if (use_ppoll_time64) {
+		if (SYS_ppoll == SYS_ppoll_time64 || !IS32BIT(s))
+			r = __syscall_cp(SYS_ppoll_time64, fds, n,
+				to ? ((long long[]){s, ns}) : 0,
+				mask, _NSIG/8);
+		if (SYS_ppoll == SYS_ppoll_time64 || r != -ENOSYS)
+			return __syscall_ret(r);
+		else use_ppoll_time64 = 0;
+	}
 	s = CLAMP(s);
 #endif
 	return syscall_cp(SYS_ppoll, fds, n,
diff --git a/src/linux/timerfd.c b/src/linux/timerfd.c
index 5bdfaf16..65bda29b 100644
--- a/src/linux/timerfd.c
+++ b/src/linux/timerfd.c
@@ -14,13 +14,17 @@ int timerfd_settime(int fd, int flags, const struct itimerspec *new, struct itim
 #ifdef SYS_timerfd_settime64
 	time_t is = new->it_interval.tv_sec, vs = new->it_value.tv_sec;
 	long ins = new->it_interval.tv_nsec, vns = new->it_value.tv_nsec;
+	static int use_timerfd_settime64 = 1;
 	int r = -ENOSYS;
-	if (SYS_timerfd_settime == SYS_timerfd_settime64
-	    || !IS32BIT(is) || !IS32BIT(vs) || (sizeof(time_t)>4 && old))
-		r = __syscall(SYS_timerfd_settime64, fd, flags,
-			((long long[]){is, ins, vs, vns}), old);
-	if (SYS_timerfd_settime == SYS_timerfd_settime64 || r!=-ENOSYS)
-		return __syscall_ret(r);
+	if (use_timerfd_settime64) {
+		if (SYS_timerfd_settime == SYS_timerfd_settime64
+			|| !IS32BIT(is) || !IS32BIT(vs) || (sizeof(time_t)>4 && old))
+			r = __syscall(SYS_timerfd_settime64, fd, flags,
+				((long long[]){is, ins, vs, vns}), old);
+		if (SYS_timerfd_settime == SYS_timerfd_settime64 || r!=-ENOSYS)
+			return __syscall_ret(r);
+		else use_timerfd_settime64 = 0;
+	}
 	if (!IS32BIT(is) || !IS32BIT(vs))
 		return __syscall_ret(-ENOTSUP);
 	long old32[4];
@@ -40,11 +44,15 @@ int timerfd_settime(int fd, int flags, const struct itimerspec *new, struct itim
 int timerfd_gettime(int fd, struct itimerspec *cur)
 {
 #ifdef SYS_timerfd_gettime64
+	static int use_timerfd_gettime64 = 1;
 	int r = -ENOSYS;
-	if (sizeof(time_t) > 4)
-		r = __syscall(SYS_timerfd_gettime64, fd, cur);
-	if (SYS_timerfd_gettime == SYS_timerfd_gettime64 || r!=-ENOSYS)
-		return __syscall_ret(r);
+	if (use_timerfd_gettime64) {
+		if (sizeof(time_t) > 4)
+			r = __syscall(SYS_timerfd_gettime64, fd, cur);
+		if (SYS_timerfd_gettime == SYS_timerfd_gettime64 || r!=-ENOSYS)
+			return __syscall_ret(r);
+		else use_timerfd_gettime64 = 0;
+	}
 	long cur32[4];
 	r = __syscall(SYS_timerfd_gettime, fd, cur32);
 	if (!r) {
diff --git a/src/misc/getrusage.c b/src/misc/getrusage.c
index 8e03e2e3..47d87981 100644
--- a/src/misc/getrusage.c
+++ b/src/misc/getrusage.c
@@ -7,19 +7,23 @@ int getrusage(int who, struct rusage *ru)
 {
 	int r;
 #ifdef SYS_getrusage_time64
-	long long kru64[18];
-	r = __syscall(SYS_getrusage_time64, who, kru64);
-	if (!r) {
-		ru->ru_utime = (struct timeval)
-			{ .tv_sec = kru64[0], .tv_usec = kru64[1] };
-		ru->ru_stime = (struct timeval)
-			{ .tv_sec = kru64[2], .tv_usec = kru64[3] };
-		char *slots = (char *)&ru->ru_maxrss;
-		for (int i=0; i<14; i++)
-			*(long *)(slots + i*sizeof(long)) = kru64[4+i];
+	static int use_getrusage_time64 = 1;
+	if (use_getrusage_time64) {
+		long long kru64[18];
+		r = __syscall(SYS_getrusage_time64, who, kru64);
+		if (!r) {
+			ru->ru_utime = (struct timeval)
+				{ .tv_sec = kru64[0], .tv_usec = kru64[1] };
+			ru->ru_stime = (struct timeval)
+				{ .tv_sec = kru64[2], .tv_usec = kru64[3] };
+			char *slots = (char *)&ru->ru_maxrss;
+			for (int i=0; i<14; i++)
+				*(long *)(slots + i*sizeof(long)) = kru64[4+i];
+		}
+		if (SYS_getrusage_time64 == SYS_getrusage || r != -ENOSYS)
+			return __syscall_ret(r);
+		else use_getrusage_time64 = 0;
 	}
-	if (SYS_getrusage_time64 == SYS_getrusage || r != -ENOSYS)
-		return __syscall_ret(r);
 #endif
 	char *dest = (char *)&ru->ru_maxrss - 4*sizeof(long);
 	r = __syscall(SYS_getrusage, who, dest);
diff --git a/src/mq/mq_timedreceive.c b/src/mq/mq_timedreceive.c
index f41b6642..43c4fef6 100644
--- a/src/mq/mq_timedreceive.c
+++ b/src/mq/mq_timedreceive.c
@@ -8,14 +8,18 @@
 ssize_t mq_timedreceive(mqd_t mqd, char *restrict msg, size_t len, unsigned *restrict prio, const struct timespec *restrict at)
 {
 #ifdef SYS_mq_timedreceive_time64
+	static int use_mq_timedreceive_time64 = 1;
 	time_t s = at ? at->tv_sec : 0;
 	long ns = at ? at->tv_nsec : 0;
 	long r = -ENOSYS;
-	if (SYS_mq_timedreceive == SYS_mq_timedreceive_time64 || !IS32BIT(s))
-		r = __syscall_cp(SYS_mq_timedreceive_time64, mqd, msg, len, prio,
-			at ? ((long long []){at->tv_sec, at->tv_nsec}) : 0);
-	if (SYS_mq_timedreceive == SYS_mq_timedreceive_time64 || r != -ENOSYS)
-		return __syscall_ret(r);
+	if (use_mq_timedreceive_time64) {
+		if (SYS_mq_timedreceive == SYS_mq_timedreceive_time64 || !IS32BIT(s))
+			r = __syscall_cp(SYS_mq_timedreceive_time64, mqd, msg, len, prio,
+				at ? ((long long []){at->tv_sec, at->tv_nsec}) : 0);
+		if (SYS_mq_timedreceive == SYS_mq_timedreceive_time64 || r != -ENOSYS)
+			return __syscall_ret(r);
+		else use_mq_timedreceive_time64 = 0;
+	}
 	return syscall_cp(SYS_mq_timedreceive, mqd, msg, len, prio,
 		at ? ((long[]){CLAMP(s), ns}) : 0);
 #else
diff --git a/src/mq/mq_timedsend.c b/src/mq/mq_timedsend.c
index 56cfcbb8..77164093 100644
--- a/src/mq/mq_timedsend.c
+++ b/src/mq/mq_timedsend.c
@@ -8,14 +8,18 @@
 int mq_timedsend(mqd_t mqd, const char *msg, size_t len, unsigned prio, const struct timespec *at)
 {
 #ifdef SYS_mq_timedsend_time64
+	static int use_mq_timedsend_time64 = 1;
 	time_t s = at ? at->tv_sec : 0;
 	long ns = at ? at->tv_nsec : 0;
 	long r = -ENOSYS;
-	if (SYS_mq_timedsend == SYS_mq_timedsend_time64 || !IS32BIT(s))
-		r = __syscall_cp(SYS_mq_timedsend_time64, mqd, msg, len, prio,
-			at ? ((long long []){at->tv_sec, at->tv_nsec}) : 0);
-	if (SYS_mq_timedsend == SYS_mq_timedsend_time64 || r != -ENOSYS)
-		return __syscall_ret(r);
+	if (use_mq_timedsend_time64) {
+		if (SYS_mq_timedsend == SYS_mq_timedsend_time64 || !IS32BIT(s))
+			r = __syscall_cp(SYS_mq_timedsend_time64, mqd, msg, len, prio,
+				at ? ((long long []){at->tv_sec, at->tv_nsec}) : 0);
+		if (SYS_mq_timedsend == SYS_mq_timedsend_time64 || r != -ENOSYS)
+			return __syscall_ret(r);
+		else use_mq_timedsend_time64 = 0;
+	}
 	return syscall_cp(SYS_mq_timedsend, mqd, msg, len, prio,
 		at ? ((long[]){CLAMP(s), ns}) : 0);
 #else
diff --git a/src/network/recvmmsg.c b/src/network/recvmmsg.c
index 2978e2f6..ea603854 100644
--- a/src/network/recvmmsg.c
+++ b/src/network/recvmmsg.c
@@ -19,12 +19,17 @@ int recvmmsg(int fd, struct mmsghdr *msgvec, unsigned int vlen, unsigned int fla
 		mh->msg_hdr.__pad1 = mh->msg_hdr.__pad2 = 0;
 #endif
 #ifdef SYS_recvmmsg_time64
+	static int use_recvmmsg_time64 = 1;
 	time_t s = timeout ? timeout->tv_sec : 0;
 	long ns = timeout ? timeout->tv_nsec : 0;
-	int r = __syscall_cp(SYS_recvmmsg_time64, fd, msgvec, vlen, flags,
-			timeout ? ((long long[]){s, ns}) : 0);
-	if (SYS_recvmmsg == SYS_recvmmsg_time64 || r!=-ENOSYS)
-		return __syscall_ret(r);
+	int r;
+	if (use_recvmmsg_time64) {
+		r = __syscall_cp(SYS_recvmmsg_time64, fd, msgvec, vlen, flags,
+				timeout ? ((long long[]){s, ns}) : 0);
+		if (SYS_recvmmsg == SYS_recvmmsg_time64 || r!=-ENOSYS)
+			return __syscall_ret(r);
+		else use_recvmmsg_time64 = 0;
+	}
 	if (vlen > IOV_MAX) vlen = IOV_MAX;
 	socklen_t csize[vlen];
 	for (int i=0; i<vlen; i++) csize[i] = msgvec[i].msg_hdr.msg_controllen;
diff --git a/src/select/pselect.c b/src/select/pselect.c
index 54cfb291..ef221f14 100644
--- a/src/select/pselect.c
+++ b/src/select/pselect.c
@@ -13,12 +13,16 @@ int pselect(int n, fd_set *restrict rfds, fd_set *restrict wfds, fd_set *restric
 	time_t s = ts ? ts->tv_sec : 0;
 	long ns = ts ? ts->tv_nsec : 0;
 #ifdef SYS_pselect6_time64
+	static int use_pselect6_time64 = 1;
 	int r = -ENOSYS;
-	if (SYS_pselect6 == SYS_pselect6_time64 || !IS32BIT(s))
-		r = __syscall_cp(SYS_pselect6_time64, n, rfds, wfds, efds,
-			ts ? ((long long[]){s, ns}) : 0, data);
-	if (SYS_pselect6 == SYS_pselect6_time64 || r!=-ENOSYS)
-		return __syscall_ret(r);
+	if (use_pselect6_time64) {
+		if (SYS_pselect6 == SYS_pselect6_time64 || !IS32BIT(s))
+			r = __syscall_cp(SYS_pselect6_time64, n, rfds, wfds, efds,
+				ts ? ((long long[]){s, ns}) : 0, data);
+		if (SYS_pselect6 == SYS_pselect6_time64 || r!=-ENOSYS)
+			return __syscall_ret(r);
+		else use_pselect6_time64 = 0;
+	}
 	s = CLAMP(s);
 #endif
 	return syscall_cp(SYS_pselect6, n, rfds, wfds, efds,
diff --git a/src/select/select.c b/src/select/select.c
index 8a786884..810cf450 100644
--- a/src/select/select.c
+++ b/src/select/select.c
@@ -26,13 +26,17 @@ int select(int n, fd_set *restrict rfds, fd_set *restrict wfds, fd_set *restrict
 	}
 
 #ifdef SYS_pselect6_time64
+	static int use_pselect6_time64 = 1;
 	int r = -ENOSYS;
-	if (SYS_pselect6 == SYS_pselect6_time64 || !IS32BIT(s))
-		r = __syscall_cp(SYS_pselect6_time64, n, rfds, wfds, efds,
-			tv ? ((long long[]){s, ns}) : 0,
-			((syscall_arg_t[]){ 0, _NSIG/8 }));
-	if (SYS_pselect6 == SYS_pselect6_time64 || r!=-ENOSYS)
-		return __syscall_ret(r);
+	if (use_pselect6_time64) {
+		if (SYS_pselect6 == SYS_pselect6_time64 || !IS32BIT(s))
+			r = __syscall_cp(SYS_pselect6_time64, n, rfds, wfds, efds,
+				tv ? ((long long[]){s, ns}) : 0,
+				((syscall_arg_t[]){ 0, _NSIG/8 }));
+		if (SYS_pselect6 == SYS_pselect6_time64 || r!=-ENOSYS)
+			return __syscall_ret(r);
+		else use_pselect6_time64 = 0;
+	}
 #endif
 #ifdef SYS_select
 	return syscall_cp(SYS_select, n, rfds, wfds, efds,
diff --git a/src/signal/sigtimedwait.c b/src/signal/sigtimedwait.c
index 1287174e..efc786db 100644
--- a/src/signal/sigtimedwait.c
+++ b/src/signal/sigtimedwait.c
@@ -8,14 +8,18 @@
 static int do_sigtimedwait(const sigset_t *restrict mask, siginfo_t *restrict si, const struct timespec *restrict ts)
 {
 #ifdef SYS_rt_sigtimedwait_time64
+	static int use_rt_sigtimedwait_time64 = 1;
 	time_t s = ts ? ts->tv_sec : 0;
 	long ns = ts ? ts->tv_nsec : 0;
 	int r = -ENOSYS;
-	if (SYS_rt_sigtimedwait == SYS_rt_sigtimedwait_time64 || !IS32BIT(s))
-		r = __syscall_cp(SYS_rt_sigtimedwait_time64, mask, si,
-			ts ? ((long long[]){s, ns}) : 0, _NSIG/8);
-	if (SYS_rt_sigtimedwait == SYS_rt_sigtimedwait_time64 || r!=-ENOSYS)
-		return r;
+	if (use_rt_sigtimedwait_time64) {
+		if (SYS_rt_sigtimedwait == SYS_rt_sigtimedwait_time64 || !IS32BIT(s))
+			r = __syscall_cp(SYS_rt_sigtimedwait_time64, mask, si,
+				ts ? ((long long[]){s, ns}) : 0, _NSIG/8);
+		if (SYS_rt_sigtimedwait == SYS_rt_sigtimedwait_time64 || r!=-ENOSYS)
+			return r;
+		else use_rt_sigtimedwait_time64 = 0;
+	}
 	return __syscall_cp(SYS_rt_sigtimedwait, mask, si,
 		ts ? ((long[]){CLAMP(s), ns}) : 0, _NSIG/8);;
 #else
diff --git a/src/stat/fstatat.c b/src/stat/fstatat.c
index de165b5c..e8e8c866 100644
--- a/src/stat/fstatat.c
+++ b/src/stat/fstatat.c
@@ -133,10 +133,14 @@ static int fstatat_kstat(int fd, const char *restrict path, struct stat *restric
 
 int fstatat(int fd, const char *restrict path, struct stat *restrict st, int flag)
 {
+	static int use_statx = 1;
 	int ret;
-	if (sizeof((struct kstat){0}.st_atime_sec) < sizeof(time_t)) {
-		ret = fstatat_statx(fd, path, st, flag);
-		if (ret!=-ENOSYS) return __syscall_ret(ret);
+	if (use_statx) {
+		if (sizeof((struct kstat){0}.st_atime_sec) < sizeof(time_t)) {
+			ret = fstatat_statx(fd, path, st, flag);
+			if (ret!=-ENOSYS) return __syscall_ret(ret);
+			else use_statx = 0;
+		}
 	}
 	ret = fstatat_kstat(fd, path, st, flag);
 	return __syscall_ret(ret);
diff --git a/src/stat/utimensat.c b/src/stat/utimensat.c
index 730723a9..adde7e4e 100644
--- a/src/stat/utimensat.c
+++ b/src/stat/utimensat.c
@@ -13,6 +13,7 @@ int utimensat(int fd, const char *path, const struct timespec times[2], int flag
 	if (times && times[0].tv_nsec==UTIME_NOW && times[1].tv_nsec==UTIME_NOW)
 		times = 0;
 #ifdef SYS_utimensat_time64
+	static int use_utimensat_time64 = 1;
 	r = -ENOSYS;
 	time_t s0=0, s1=0;
 	long ns0=0, ns1=0;
@@ -22,11 +23,14 @@ int utimensat(int fd, const char *path, const struct timespec times[2], int flag
 		if (!NS_SPECIAL(ns0)) s0 = times[0].tv_sec;
 		if (!NS_SPECIAL(ns1)) s1 = times[1].tv_sec;
 	}
-	if (SYS_utimensat == SYS_utimensat_time64 || !IS32BIT(s0) || !IS32BIT(s1))
-		r = __syscall(SYS_utimensat_time64, fd, path, times ?
-			((long long[]){s0, ns0, s1, ns1}) : 0, flags);
-	if (SYS_utimensat == SYS_utimensat_time64 || r!=-ENOSYS)
-		return __syscall_ret(r);
+	if (use_utimensat_time64) {
+		if (SYS_utimensat == SYS_utimensat_time64 || !IS32BIT(s0) || !IS32BIT(s1))
+			r = __syscall(SYS_utimensat_time64, fd, path, times ?
+				((long long[]){s0, ns0, s1, ns1}) : 0, flags);
+		if (SYS_utimensat == SYS_utimensat_time64 || r!=-ENOSYS)
+			return __syscall_ret(r);
+		else use_utimensat_time64 = 0;
+	}
 	if (!IS32BIT(s0) || !IS32BIT(s1))
 		return __syscall_ret(-ENOTSUP);
 	r = __syscall(SYS_utimensat, fd, path,
diff --git a/src/thread/__timedwait.c b/src/thread/__timedwait.c
index 666093be..2c417837 100644
--- a/src/thread/__timedwait.c
+++ b/src/thread/__timedwait.c
@@ -12,13 +12,17 @@ static int __futex4_cp(volatile void *addr, int op, int val, const struct timesp
 {
 	int r;
 #ifdef SYS_futex_time64
+	static int use_futex_time64 = 1;
 	time_t s = to ? to->tv_sec : 0;
 	long ns = to ? to->tv_nsec : 0;
 	r = -ENOSYS;
-	if (SYS_futex == SYS_futex_time64 || !IS32BIT(s))
-		r = __syscall_cp(SYS_futex_time64, addr, op, val,
-			to ? ((long long[]){s, ns}) : 0);
-	if (SYS_futex == SYS_futex_time64 || r!=-ENOSYS) return r;
+	if (use_futex_time64) {
+		if (SYS_futex == SYS_futex_time64 || !IS32BIT(s))
+			r = __syscall_cp(SYS_futex_time64, addr, op, val,
+				to ? ((long long[]){s, ns}) : 0);
+		if (SYS_futex == SYS_futex_time64 || r!=-ENOSYS) return r;
+		else use_futex_time64 = 0;
+	}
 	to = to ? (void *)(long[]){CLAMP(s), ns} : 0;
 #endif
 	r = __syscall_cp(SYS_futex, addr, op, val, to);
diff --git a/src/thread/pthread_mutex_timedlock.c b/src/thread/pthread_mutex_timedlock.c
index 9279fc54..dcf15945 100644
--- a/src/thread/pthread_mutex_timedlock.c
+++ b/src/thread/pthread_mutex_timedlock.c
@@ -6,13 +6,17 @@
 static int __futex4(volatile void *addr, int op, int val, const struct timespec *to)
 {
 #ifdef SYS_futex_time64
+	static int use_futex_time64 = 1;
 	time_t s = to ? to->tv_sec : 0;
 	long ns = to ? to->tv_nsec : 0;
 	int r = -ENOSYS;
-	if (SYS_futex == SYS_futex_time64 || !IS32BIT(s))
-		r = __syscall(SYS_futex_time64, addr, op, val,
-			to ? ((long long[]){s, ns}) : 0);
-	if (SYS_futex == SYS_futex_time64 || r!=-ENOSYS) return r;
+	if (use_futex_time64) {
+		if (SYS_futex == SYS_futex_time64 || !IS32BIT(s))
+			r = __syscall(SYS_futex_time64, addr, op, val,
+				to ? ((long long[]){s, ns}) : 0);
+		if (SYS_futex == SYS_futex_time64 || r!=-ENOSYS) return r;
+		else use_futex_time64 = 0;
+	}
 	to = to ? (void *)(long[]){CLAMP(s), ns} : 0;
 #endif
 	return __syscall(SYS_futex, addr, op, val, to);
diff --git a/src/time/clock_gettime.c b/src/time/clock_gettime.c
index 3e1d0975..736fde8b 100644
--- a/src/time/clock_gettime.c
+++ b/src/time/clock_gettime.c
@@ -73,11 +73,15 @@ int __clock_gettime(clockid_t clk, struct timespec *ts)
 #endif
 
 #ifdef SYS_clock_gettime64
+	static int use_clock_gettime64 = 1;
 	r = -ENOSYS;
-	if (sizeof(time_t) > 4)
-		r = __syscall(SYS_clock_gettime64, clk, ts);
-	if (SYS_clock_gettime == SYS_clock_gettime64 || r!=-ENOSYS)
-		return __syscall_ret(r);
+	if (use_clock_gettime64) {
+		if (sizeof(time_t) > 4)
+			r = __syscall(SYS_clock_gettime64, clk, ts);
+		if (SYS_clock_gettime == SYS_clock_gettime64 || r!=-ENOSYS)
+			return __syscall_ret(r);
+		else use_clock_gettime64 = 0;
+	}
 	long ts32[2];
 	r = __syscall(SYS_clock_gettime, clk, ts32);
 	if (r==-ENOSYS && clk==CLOCK_REALTIME) {
diff --git a/src/time/clock_nanosleep.c b/src/time/clock_nanosleep.c
index e195499c..bbb657b6 100644
--- a/src/time/clock_nanosleep.c
+++ b/src/time/clock_nanosleep.c
@@ -9,14 +9,18 @@ int __clock_nanosleep(clockid_t clk, int flags, const struct timespec *req, stru
 {
 	if (clk == CLOCK_THREAD_CPUTIME_ID) return EINVAL;
 #ifdef SYS_clock_nanosleep_time64
+	static int use_clock_nanosleep_time64 = 1;
 	time_t s = req->tv_sec;
 	long ns = req->tv_nsec;
 	int r = -ENOSYS;
-	if (SYS_clock_nanosleep == SYS_clock_nanosleep_time64 || !IS32BIT(s))
-		r = __syscall_cp(SYS_clock_nanosleep_time64, clk, flags,
-			((long long[]){s, ns}), rem);
-	if (SYS_clock_nanosleep == SYS_clock_nanosleep_time64 || r!=-ENOSYS)
-		return -r;
+	if (use_clock_nanosleep_time64) {
+		if (SYS_clock_nanosleep == SYS_clock_nanosleep_time64 || !IS32BIT(s))
+			r = __syscall_cp(SYS_clock_nanosleep_time64, clk, flags,
+				((long long[]){s, ns}), rem);
+		if (SYS_clock_nanosleep == SYS_clock_nanosleep_time64 || r!=-ENOSYS)
+			return -r;
+		else use_clock_nanosleep_time64 = 0;
+	}
 	long long extra = s - CLAMP(s);
 	long ts32[2] = { CLAMP(s), ns };
 	if (clk == CLOCK_REALTIME && !flags)
diff --git a/src/time/clock_settime.c b/src/time/clock_settime.c
index 1004ed15..aa137fa7 100644
--- a/src/time/clock_settime.c
+++ b/src/time/clock_settime.c
@@ -7,14 +7,18 @@
 int clock_settime(clockid_t clk, const struct timespec *ts)
 {
 #ifdef SYS_clock_settime64
+	static int use_clock_settime64 = 1;
 	time_t s = ts->tv_sec;
 	long ns = ts->tv_nsec;
 	int r = -ENOSYS;
-	if (SYS_clock_settime == SYS_clock_settime64 || !IS32BIT(s))
-		r = __syscall(SYS_clock_settime64, clk,
-			((long long[]){s, ns}));
-	if (SYS_clock_settime == SYS_clock_settime64 || r!=-ENOSYS)
-		return __syscall_ret(r);
+	if (use_clock_settime64) {
+		if (SYS_clock_settime == SYS_clock_settime64 || !IS32BIT(s))
+			r = __syscall(SYS_clock_settime64, clk,
+				((long long[]){s, ns}));
+		if (SYS_clock_settime == SYS_clock_settime64 || r!=-ENOSYS)
+			return __syscall_ret(r);
+		else use_clock_settime64 = 0;
+	}
 	if (!IS32BIT(s))
 		return __syscall_ret(-ENOTSUP);
 	return syscall(SYS_clock_settime, clk, ((long[]){s, ns}));
-- 
2.20.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [musl] Backwards kernel compatibility
  2021-06-08 22:16       ` Martin Vajnar
@ 2021-06-09  0:37         ` Rich Felker
  0 siblings, 0 replies; 13+ messages in thread
From: Rich Felker @ 2021-06-09  0:37 UTC (permalink / raw)
  To: musl; +Cc: Martin Vajnar

On Wed, Jun 09, 2021 at 12:16:40AM +0200, Martin Vajnar wrote:
> Hi, Rich,
> 
> út 25. 5. 2021 v 0:00 odesílatel Rich Felker <dalias@libc.org> napsal:
> >
> > On Mon, May 24, 2021 at 03:52:44PM +0200, Martin Vajnar wrote:
> > > Hi, Markus,
> > >
> > > sorry for the late reply it was quite busy lately. You're describing
> > > exactly the issue, we are facing in our project. We need to use old kernel
> > > which we have only in binary form and have headers for it. At the same time
> > > we would like to have the latest musl running on it.
> > >
> > > The problem we encounter is that for unsupported (or better said, not
> > > supported yet) syscalls we get performance overhead because of the ENOSYS.
> >
> > Can you give some information on what syscalls these are and if/how
> > you measured the performance overhead as being significant?
> >
> > > We see 2 options to approach this:
> > >
> > >  1. remove the syscalls manually/alter the code to not invoke them (hacky)
> > >  2. during musl compile time (maybe even configure-time), parse the
> > > supplied kernel headers and based on availability of syscalls use defines
> > > to steer the code execution (more universal)
> > >
> > > Would the 2nd case be something that musl community would be interested in,
> > > should we choose to implement it for the project?
> >
> > No, but hopefully there's a third option: identify whatever place the
> > fallback is actual a performance bottleneck and do what we can to
> > mitigate it. If it's really bad, saving the result might be an option,
> > but we've tried to avoid that both for complexity reasons and because
> > it could preclude fixing serious problems (like Y2038 EOL) by
> > live-migrating processes to a newer kernel with new syscalls that
> > avoid the bug. A better approach is just using the "oldest" syscall
> > that can actually do the job, which we already try to do in most
> > places in musl, only relying on the newer one for inputs that require
> > it. However this is not possible for functions that read back a time,
> > since the input is external (e.g. the system clock or the filesystem)
> > and it's not known in advance whether the old syscall could represent
> > the result.
> >
> > It *might* be plausible to memorize the result "new syscall not
> > available" but drop that memory whenever we see a result that
> > indicates a failure due to use of the outdated syscall. We're kinda
> > already doing that with the vdso clock_gettime -- cgt_time32_wrap
> > disables itself if it ever sees a negative value for seconds.
> 
> Since updating the kernel is not an option for me, I prepared patch
> implementing memorizing failed syscall attempt on the time64 variants
> and on statx syscall, so on next attempt it will skip them and use
> fallback directly. Attaching the patch in case someone is solving the
> same issue. Please, let me know, if this approach would be something
> interesting for upstreaming and if so, if there are any changes I
> should make.

It's not acceptable as written for upstream. All of the memorization
is data races, and as noted before, it breaks the usage case of live
migration (CRIU or similar) from a 2038-EOL'd kernel to one with a
future.

If you want to maintain a patch to apply locally, I think it would
make sense to do away with trying to keep indentation "right" so that
the patches are minimal, even if you end up using goto. Also I think
the patch is making some changes you don't need. See below:


> From ceb2371ada673d258328a019691bff014ea3abca Mon Sep 17 00:00:00 2001
> From: Martin Vajnar <martin.vajnar@gmail.com>
> Date: Tue, 8 Jun 2021 23:36:11 +0200
> Subject: [PATCH] store fallback paths for kernels without time64/statx
>  syscalls
> 
> ---
>  src/ipc/semtimedop.c                 |  12 ++--
>  src/linux/clock_adjtime.c            | 100 ++++++++++++++-------------
>  src/linux/ppoll.c                    |  16 +++--
>  src/linux/timerfd.c                  |  28 +++++---
>  src/misc/getrusage.c                 |  28 ++++----
>  src/mq/mq_timedreceive.c             |  14 ++--
>  src/mq/mq_timedsend.c                |  14 ++--
>  src/network/recvmmsg.c               |  13 ++--
>  src/select/pselect.c                 |  14 ++--
>  src/select/select.c                  |  16 +++--
>  src/signal/sigtimedwait.c            |  14 ++--
>  src/stat/fstatat.c                   |  10 ++-
>  src/stat/utimensat.c                 |  14 ++--
>  src/thread/__timedwait.c             |  12 ++--
>  src/thread/pthread_mutex_timedlock.c |  12 ++--
>  src/time/clock_gettime.c             |  12 ++--
>  src/time/clock_nanosleep.c           |  14 ++--
>  src/time/clock_settime.c             |  14 ++--
>  18 files changed, 217 insertions(+), 140 deletions(-)
> 
> diff --git a/src/ipc/semtimedop.c b/src/ipc/semtimedop.c
> index 1632e7b0..d9fe404d 100644
> --- a/src/ipc/semtimedop.c
> +++ b/src/ipc/semtimedop.c
> @@ -16,13 +16,17 @@
>  int semtimedop(int id, struct sembuf *buf, size_t n, const struct timespec *ts)
>  {
>  #ifdef SYS_semtimedop_time64
> +	static int use_semtimedop_time64 = 1;
>  	time_t s = ts ? ts->tv_sec : 0;
>  	long ns = ts ? ts->tv_nsec : 0;
>  	int r = -ENOSYS;
> -	if (NO_TIME32 || !IS32BIT(s))
> -		r = __syscall(SYS_semtimedop_time64, id, buf, n,
> -			ts ? ((long long[]){s, ns}) : 0);
> -	if (NO_TIME32 || r!=-ENOSYS) return __syscall_ret(r);
> +	if (use_semtimedop_time64) {
> +		if (NO_TIME32 || !IS32BIT(s))
> +			r = __syscall(SYS_semtimedop_time64, id, buf, n,
> +				ts ? ((long long[]){s, ns}) : 0);
> +		if (NO_TIME32 || r!=-ENOSYS) return __syscall_ret(r);
> +		else use_semtimedop_time64 = 0;
> +	}
>  	ts = ts ? (void *)(long[]){CLAMP(s), ns} : 0;
>  #endif
>  #if defined(SYS_ipc)

No change needed here. The time64 syscall is only made if ts does not
fit in 32 bits.

> diff --git a/src/linux/ppoll.c b/src/linux/ppoll.c
> index e614600a..939f3f1d 100644
> --- a/src/linux/ppoll.c
> +++ b/src/linux/ppoll.c
> @@ -12,13 +12,17 @@ int ppoll(struct pollfd *fds, nfds_t n, const struct timespec *to, const sigset_
>  	time_t s = to ? to->tv_sec : 0;
>  	long ns = to ? to->tv_nsec : 0;
>  #ifdef SYS_ppoll_time64
> +	static int use_ppoll_time64 = 1;
>  	int r = -ENOSYS;
> -	if (SYS_ppoll == SYS_ppoll_time64 || !IS32BIT(s))
> -		r = __syscall_cp(SYS_ppoll_time64, fds, n,
> -			to ? ((long long[]){s, ns}) : 0,
> -			mask, _NSIG/8);
> -	if (SYS_ppoll == SYS_ppoll_time64 || r != -ENOSYS)
> -		return __syscall_ret(r);
> +	if (use_ppoll_time64) {
> +		if (SYS_ppoll == SYS_ppoll_time64 || !IS32BIT(s))
> +			r = __syscall_cp(SYS_ppoll_time64, fds, n,
> +				to ? ((long long[]){s, ns}) : 0,
> +				mask, _NSIG/8);
> +		if (SYS_ppoll == SYS_ppoll_time64 || r != -ENOSYS)
> +			return __syscall_ret(r);
> +		else use_ppoll_time64 = 0;
> +	}
>  	s = CLAMP(s);
>  #endif
>  	return syscall_cp(SYS_ppoll, fds, n,

Same here. As this is a relative timeout, the time64 syscall will
essentially _never_ be used, unless a program spuriously uses a finite
more-than-68-year timeout in place of "forever".

> diff --git a/src/linux/timerfd.c b/src/linux/timerfd.c
> index 5bdfaf16..65bda29b 100644
> --- a/src/linux/timerfd.c
> +++ b/src/linux/timerfd.c
> @@ -14,13 +14,17 @@ int timerfd_settime(int fd, int flags, const struct itimerspec *new, struct itim
>  #ifdef SYS_timerfd_settime64
>  	time_t is = new->it_interval.tv_sec, vs = new->it_value.tv_sec;
>  	long ins = new->it_interval.tv_nsec, vns = new->it_value.tv_nsec;
> +	static int use_timerfd_settime64 = 1;
>  	int r = -ENOSYS;
> -	if (SYS_timerfd_settime == SYS_timerfd_settime64
> -	    || !IS32BIT(is) || !IS32BIT(vs) || (sizeof(time_t)>4 && old))
> -		r = __syscall(SYS_timerfd_settime64, fd, flags,
> -			((long long[]){is, ins, vs, vns}), old);
> -	if (SYS_timerfd_settime == SYS_timerfd_settime64 || r!=-ENOSYS)
> -		return __syscall_ret(r);
> +	if (use_timerfd_settime64) {
> +		if (SYS_timerfd_settime == SYS_timerfd_settime64
> +			|| !IS32BIT(is) || !IS32BIT(vs) || (sizeof(time_t)>4 && old))
> +			r = __syscall(SYS_timerfd_settime64, fd, flags,
> +				((long long[]){is, ins, vs, vns}), old);
> +		if (SYS_timerfd_settime == SYS_timerfd_settime64 || r!=-ENOSYS)
> +			return __syscall_ret(r);
> +		else use_timerfd_settime64 = 0;
> +	}
>  	if (!IS32BIT(is) || !IS32BIT(vs))
>  		return __syscall_ret(-ENOTSUP);
>  	long old32[4];

Same.

> @@ -40,11 +44,15 @@ int timerfd_settime(int fd, int flags, const struct itimerspec *new, struct itim
>  int timerfd_gettime(int fd, struct itimerspec *cur)
>  {
>  #ifdef SYS_timerfd_gettime64
> +	static int use_timerfd_gettime64 = 1;
>  	int r = -ENOSYS;
> -	if (sizeof(time_t) > 4)
> -		r = __syscall(SYS_timerfd_gettime64, fd, cur);
> -	if (SYS_timerfd_gettime == SYS_timerfd_gettime64 || r!=-ENOSYS)
> -		return __syscall_ret(r);
> +	if (use_timerfd_gettime64) {
> +		if (sizeof(time_t) > 4)
> +			r = __syscall(SYS_timerfd_gettime64, fd, cur);
> +		if (SYS_timerfd_gettime == SYS_timerfd_gettime64 || r!=-ENOSYS)
> +			return __syscall_ret(r);
> +		else use_timerfd_gettime64 = 0;
> +	}
>  	long cur32[4];
>  	r = __syscall(SYS_timerfd_gettime, fd, cur32);
>  	if (!r) {

This one is actually needed for your purposes, since it's querying a
time *from* the kernel and there's no a priori way to know if it will
fit in 32 bits.

> diff --git a/src/misc/getrusage.c b/src/misc/getrusage.c
> index 8e03e2e3..47d87981 100644
> --- a/src/misc/getrusage.c
> +++ b/src/misc/getrusage.c
> @@ -7,19 +7,23 @@ int getrusage(int who, struct rusage *ru)
>  {
>  	int r;
>  #ifdef SYS_getrusage_time64
> -	long long kru64[18];
> -	r = __syscall(SYS_getrusage_time64, who, kru64);
> -	if (!r) {
> -		ru->ru_utime = (struct timeval)
> -			{ .tv_sec = kru64[0], .tv_usec = kru64[1] };
> -		ru->ru_stime = (struct timeval)
> -			{ .tv_sec = kru64[2], .tv_usec = kru64[3] };
> -		char *slots = (char *)&ru->ru_maxrss;
> -		for (int i=0; i<14; i++)
> -			*(long *)(slots + i*sizeof(long)) = kru64[4+i];
> +	static int use_getrusage_time64 = 1;
> +	if (use_getrusage_time64) {
> +		long long kru64[18];
> +		r = __syscall(SYS_getrusage_time64, who, kru64);
> +		if (!r) {
> +			ru->ru_utime = (struct timeval)
> +				{ .tv_sec = kru64[0], .tv_usec = kru64[1] };
> +			ru->ru_stime = (struct timeval)
> +				{ .tv_sec = kru64[2], .tv_usec = kru64[3] };
> +			char *slots = (char *)&ru->ru_maxrss;
> +			for (int i=0; i<14; i++)
> +				*(long *)(slots + i*sizeof(long)) = kru64[4+i];
> +		}
> +		if (SYS_getrusage_time64 == SYS_getrusage || r != -ENOSYS)
> +			return __syscall_ret(r);
> +		else use_getrusage_time64 = 0;
>  	}
> -	if (SYS_getrusage_time64 == SYS_getrusage || r != -ENOSYS)
> -		return __syscall_ret(r);
>  #endif
>  	char *dest = (char *)&ru->ru_maxrss - 4*sizeof(long);
>  	r = __syscall(SYS_getrusage, who, dest);

I don't think this time64 syscall even exits on archs with a time32
version, but I may be misremembering.

> diff --git a/src/mq/mq_timedreceive.c b/src/mq/mq_timedreceive.c
> index f41b6642..43c4fef6 100644
> --- a/src/mq/mq_timedreceive.c
> +++ b/src/mq/mq_timedreceive.c
> @@ -8,14 +8,18 @@
>  ssize_t mq_timedreceive(mqd_t mqd, char *restrict msg, size_t len, unsigned *restrict prio, const struct timespec *restrict at)
>  {
>  #ifdef SYS_mq_timedreceive_time64
> +	static int use_mq_timedreceive_time64 = 1;
>  	time_t s = at ? at->tv_sec : 0;
>  	long ns = at ? at->tv_nsec : 0;
>  	long r = -ENOSYS;
> -	if (SYS_mq_timedreceive == SYS_mq_timedreceive_time64 || !IS32BIT(s))
> -		r = __syscall_cp(SYS_mq_timedreceive_time64, mqd, msg, len, prio,
> -			at ? ((long long []){at->tv_sec, at->tv_nsec}) : 0);
> -	if (SYS_mq_timedreceive == SYS_mq_timedreceive_time64 || r != -ENOSYS)
> -		return __syscall_ret(r);
> +	if (use_mq_timedreceive_time64) {
> +		if (SYS_mq_timedreceive == SYS_mq_timedreceive_time64 || !IS32BIT(s))
> +			r = __syscall_cp(SYS_mq_timedreceive_time64, mqd, msg, len, prio,
> +				at ? ((long long []){at->tv_sec, at->tv_nsec}) : 0);
> +		if (SYS_mq_timedreceive == SYS_mq_timedreceive_time64 || r != -ENOSYS)
> +			return __syscall_ret(r);
> +		else use_mq_timedreceive_time64 = 0;
> +	}
>  	return syscall_cp(SYS_mq_timedreceive, mqd, msg, len, prio,
>  		at ? ((long[]){CLAMP(s), ns}) : 0);
>  #else

Also not needed, for above reason.

> diff --git a/src/mq/mq_timedsend.c b/src/mq/mq_timedsend.c
> index 56cfcbb8..77164093 100644
> --- a/src/mq/mq_timedsend.c
> +++ b/src/mq/mq_timedsend.c
> @@ -8,14 +8,18 @@
>  int mq_timedsend(mqd_t mqd, const char *msg, size_t len, unsigned prio, const struct timespec *at)
>  {
>  #ifdef SYS_mq_timedsend_time64
> +	static int use_mq_timedsend_time64 = 1;
>  	time_t s = at ? at->tv_sec : 0;
>  	long ns = at ? at->tv_nsec : 0;
>  	long r = -ENOSYS;
> -	if (SYS_mq_timedsend == SYS_mq_timedsend_time64 || !IS32BIT(s))
> -		r = __syscall_cp(SYS_mq_timedsend_time64, mqd, msg, len, prio,
> -			at ? ((long long []){at->tv_sec, at->tv_nsec}) : 0);
> -	if (SYS_mq_timedsend == SYS_mq_timedsend_time64 || r != -ENOSYS)
> -		return __syscall_ret(r);
> +	if (use_mq_timedsend_time64) {
> +		if (SYS_mq_timedsend == SYS_mq_timedsend_time64 || !IS32BIT(s))
> +			r = __syscall_cp(SYS_mq_timedsend_time64, mqd, msg, len, prio,
> +				at ? ((long long []){at->tv_sec, at->tv_nsec}) : 0);
> +		if (SYS_mq_timedsend == SYS_mq_timedsend_time64 || r != -ENOSYS)
> +			return __syscall_ret(r);
> +		else use_mq_timedsend_time64 = 0;
> +	}
>  	return syscall_cp(SYS_mq_timedsend, mqd, msg, len, prio,
>  		at ? ((long[]){CLAMP(s), ns}) : 0);
>  #else

Same.

> diff --git a/src/network/recvmmsg.c b/src/network/recvmmsg.c
> index 2978e2f6..ea603854 100644
> --- a/src/network/recvmmsg.c
> +++ b/src/network/recvmmsg.c
> @@ -19,12 +19,17 @@ int recvmmsg(int fd, struct mmsghdr *msgvec, unsigned int vlen, unsigned int fla
>  		mh->msg_hdr.__pad1 = mh->msg_hdr.__pad2 = 0;
>  #endif
>  #ifdef SYS_recvmmsg_time64
> +	static int use_recvmmsg_time64 = 1;
>  	time_t s = timeout ? timeout->tv_sec : 0;
>  	long ns = timeout ? timeout->tv_nsec : 0;
> -	int r = __syscall_cp(SYS_recvmmsg_time64, fd, msgvec, vlen, flags,
> -			timeout ? ((long long[]){s, ns}) : 0);
> -	if (SYS_recvmmsg == SYS_recvmmsg_time64 || r!=-ENOSYS)
> -		return __syscall_ret(r);
> +	int r;
> +	if (use_recvmmsg_time64) {
> +		r = __syscall_cp(SYS_recvmmsg_time64, fd, msgvec, vlen, flags,
> +				timeout ? ((long long[]){s, ns}) : 0);
> +		if (SYS_recvmmsg == SYS_recvmmsg_time64 || r!=-ENOSYS)
> +			return __syscall_ret(r);
> +		else use_recvmmsg_time64 = 0;
> +	}
>  	if (vlen > IOV_MAX) vlen = IOV_MAX;
>  	socklen_t csize[vlen];
>  	for (int i=0; i<vlen; i++) csize[i] = msgvec[i].msg_hdr.msg_controllen;

I don't recall why this one doesn't use the old syscall when the time
fits, so as written yoyr patch seems to be needed.

> diff --git a/src/select/pselect.c b/src/select/pselect.c
> index 54cfb291..ef221f14 100644
> --- a/src/select/pselect.c
> +++ b/src/select/pselect.c
> @@ -13,12 +13,16 @@ int pselect(int n, fd_set *restrict rfds, fd_set *restrict wfds, fd_set *restric
>  	time_t s = ts ? ts->tv_sec : 0;
>  	long ns = ts ? ts->tv_nsec : 0;
>  #ifdef SYS_pselect6_time64
> +	static int use_pselect6_time64 = 1;
>  	int r = -ENOSYS;
> -	if (SYS_pselect6 == SYS_pselect6_time64 || !IS32BIT(s))
> -		r = __syscall_cp(SYS_pselect6_time64, n, rfds, wfds, efds,
> -			ts ? ((long long[]){s, ns}) : 0, data);
> -	if (SYS_pselect6 == SYS_pselect6_time64 || r!=-ENOSYS)
> -		return __syscall_ret(r);
> +	if (use_pselect6_time64) {
> +		if (SYS_pselect6 == SYS_pselect6_time64 || !IS32BIT(s))
> +			r = __syscall_cp(SYS_pselect6_time64, n, rfds, wfds, efds,
> +				ts ? ((long long[]){s, ns}) : 0, data);
> +		if (SYS_pselect6 == SYS_pselect6_time64 || r!=-ENOSYS)
> +			return __syscall_ret(r);
> +		else use_pselect6_time64 = 0;
> +	}
>  	s = CLAMP(s);
>  #endif
>  	return syscall_cp(SYS_pselect6, n, rfds, wfds, efds,

This shouldn't be needed for same reason as above.

> diff --git a/src/select/select.c b/src/select/select.c
> index 8a786884..810cf450 100644
> --- a/src/select/select.c
> +++ b/src/select/select.c
> @@ -26,13 +26,17 @@ int select(int n, fd_set *restrict rfds, fd_set *restrict wfds, fd_set *restrict
>  	}
>  
>  #ifdef SYS_pselect6_time64
> +	static int use_pselect6_time64 = 1;
>  	int r = -ENOSYS;
> -	if (SYS_pselect6 == SYS_pselect6_time64 || !IS32BIT(s))
> -		r = __syscall_cp(SYS_pselect6_time64, n, rfds, wfds, efds,
> -			tv ? ((long long[]){s, ns}) : 0,
> -			((syscall_arg_t[]){ 0, _NSIG/8 }));
> -	if (SYS_pselect6 == SYS_pselect6_time64 || r!=-ENOSYS)
> -		return __syscall_ret(r);
> +	if (use_pselect6_time64) {
> +		if (SYS_pselect6 == SYS_pselect6_time64 || !IS32BIT(s))
> +			r = __syscall_cp(SYS_pselect6_time64, n, rfds, wfds, efds,
> +				tv ? ((long long[]){s, ns}) : 0,
> +				((syscall_arg_t[]){ 0, _NSIG/8 }));
> +		if (SYS_pselect6 == SYS_pselect6_time64 || r!=-ENOSYS)
> +			return __syscall_ret(r);
> +		else use_pselect6_time64 = 0;
> +	}
>  #endif
>  #ifdef SYS_select
>  	return syscall_cp(SYS_select, n, rfds, wfds, efds,

Same.

> diff --git a/src/signal/sigtimedwait.c b/src/signal/sigtimedwait.c
> index 1287174e..efc786db 100644
> --- a/src/signal/sigtimedwait.c
> +++ b/src/signal/sigtimedwait.c
> @@ -8,14 +8,18 @@
>  static int do_sigtimedwait(const sigset_t *restrict mask, siginfo_t *restrict si, const struct timespec *restrict ts)
>  {
>  #ifdef SYS_rt_sigtimedwait_time64
> +	static int use_rt_sigtimedwait_time64 = 1;
>  	time_t s = ts ? ts->tv_sec : 0;
>  	long ns = ts ? ts->tv_nsec : 0;
>  	int r = -ENOSYS;
> -	if (SYS_rt_sigtimedwait == SYS_rt_sigtimedwait_time64 || !IS32BIT(s))
> -		r = __syscall_cp(SYS_rt_sigtimedwait_time64, mask, si,
> -			ts ? ((long long[]){s, ns}) : 0, _NSIG/8);
> -	if (SYS_rt_sigtimedwait == SYS_rt_sigtimedwait_time64 || r!=-ENOSYS)
> -		return r;
> +	if (use_rt_sigtimedwait_time64) {
> +		if (SYS_rt_sigtimedwait == SYS_rt_sigtimedwait_time64 || !IS32BIT(s))
> +			r = __syscall_cp(SYS_rt_sigtimedwait_time64, mask, si,
> +				ts ? ((long long[]){s, ns}) : 0, _NSIG/8);
> +		if (SYS_rt_sigtimedwait == SYS_rt_sigtimedwait_time64 || r!=-ENOSYS)
> +			return r;
> +		else use_rt_sigtimedwait_time64 = 0;
> +	}
>  	return __syscall_cp(SYS_rt_sigtimedwait, mask, si,
>  		ts ? ((long[]){CLAMP(s), ns}) : 0, _NSIG/8);;
>  #else

Same.

> diff --git a/src/stat/fstatat.c b/src/stat/fstatat.c
> index de165b5c..e8e8c866 100644
> --- a/src/stat/fstatat.c
> +++ b/src/stat/fstatat.c
> @@ -133,10 +133,14 @@ static int fstatat_kstat(int fd, const char *restrict path, struct stat *restric
>  
>  int fstatat(int fd, const char *restrict path, struct stat *restrict st, int flag)
>  {
> +	static int use_statx = 1;
>  	int ret;
> -	if (sizeof((struct kstat){0}.st_atime_sec) < sizeof(time_t)) {
> -		ret = fstatat_statx(fd, path, st, flag);
> -		if (ret!=-ENOSYS) return __syscall_ret(ret);
> +	if (use_statx) {
> +		if (sizeof((struct kstat){0}.st_atime_sec) < sizeof(time_t)) {
> +			ret = fstatat_statx(fd, path, st, flag);
> +			if (ret!=-ENOSYS) return __syscall_ret(ret);
> +			else use_statx = 0;
> +		}
>  	}
>  	ret = fstatat_kstat(fd, path, st, flag);
>  	return __syscall_ret(ret);

Indeed this case is needed because it's a "reading back time" one.

> diff --git a/src/stat/utimensat.c b/src/stat/utimensat.c
> index 730723a9..adde7e4e 100644
> --- a/src/stat/utimensat.c
> +++ b/src/stat/utimensat.c
> @@ -13,6 +13,7 @@ int utimensat(int fd, const char *path, const struct timespec times[2], int flag
>  	if (times && times[0].tv_nsec==UTIME_NOW && times[1].tv_nsec==UTIME_NOW)
>  		times = 0;
>  #ifdef SYS_utimensat_time64
> +	static int use_utimensat_time64 = 1;
>  	r = -ENOSYS;
>  	time_t s0=0, s1=0;
>  	long ns0=0, ns1=0;
> @@ -22,11 +23,14 @@ int utimensat(int fd, const char *path, const struct timespec times[2], int flag
>  		if (!NS_SPECIAL(ns0)) s0 = times[0].tv_sec;
>  		if (!NS_SPECIAL(ns1)) s1 = times[1].tv_sec;
>  	}
> -	if (SYS_utimensat == SYS_utimensat_time64 || !IS32BIT(s0) || !IS32BIT(s1))
> -		r = __syscall(SYS_utimensat_time64, fd, path, times ?
> -			((long long[]){s0, ns0, s1, ns1}) : 0, flags);
> -	if (SYS_utimensat == SYS_utimensat_time64 || r!=-ENOSYS)
> -		return __syscall_ret(r);
> +	if (use_utimensat_time64) {
> +		if (SYS_utimensat == SYS_utimensat_time64 || !IS32BIT(s0) || !IS32BIT(s1))
> +			r = __syscall(SYS_utimensat_time64, fd, path, times ?
> +				((long long[]){s0, ns0, s1, ns1}) : 0, flags);
> +		if (SYS_utimensat == SYS_utimensat_time64 || r!=-ENOSYS)
> +			return __syscall_ret(r);
> +		else use_utimensat_time64 = 0;
> +	}
>  	if (!IS32BIT(s0) || !IS32BIT(s1))
>  		return __syscall_ret(-ENOTSUP);
>  	r = __syscall(SYS_utimensat, fd, path,

Not needed.

> diff --git a/src/thread/__timedwait.c b/src/thread/__timedwait.c
> index 666093be..2c417837 100644
> --- a/src/thread/__timedwait.c
> +++ b/src/thread/__timedwait.c
> @@ -12,13 +12,17 @@ static int __futex4_cp(volatile void *addr, int op, int val, const struct timesp
>  {
>  	int r;
>  #ifdef SYS_futex_time64
> +	static int use_futex_time64 = 1;
>  	time_t s = to ? to->tv_sec : 0;
>  	long ns = to ? to->tv_nsec : 0;
>  	r = -ENOSYS;
> -	if (SYS_futex == SYS_futex_time64 || !IS32BIT(s))
> -		r = __syscall_cp(SYS_futex_time64, addr, op, val,
> -			to ? ((long long[]){s, ns}) : 0);
> -	if (SYS_futex == SYS_futex_time64 || r!=-ENOSYS) return r;
> +	if (use_futex_time64) {
> +		if (SYS_futex == SYS_futex_time64 || !IS32BIT(s))
> +			r = __syscall_cp(SYS_futex_time64, addr, op, val,
> +				to ? ((long long[]){s, ns}) : 0);
> +		if (SYS_futex == SYS_futex_time64 || r!=-ENOSYS) return r;
> +		else use_futex_time64 = 0;
> +	}
>  	to = to ? (void *)(long[]){CLAMP(s), ns} : 0;
>  #endif
>  	r = __syscall_cp(SYS_futex, addr, op, val, to);

Not needed.

> diff --git a/src/thread/pthread_mutex_timedlock.c b/src/thread/pthread_mutex_timedlock.c
> index 9279fc54..dcf15945 100644
> --- a/src/thread/pthread_mutex_timedlock.c
> +++ b/src/thread/pthread_mutex_timedlock.c
> @@ -6,13 +6,17 @@
>  static int __futex4(volatile void *addr, int op, int val, const struct timespec *to)
>  {
>  #ifdef SYS_futex_time64
> +	static int use_futex_time64 = 1;
>  	time_t s = to ? to->tv_sec : 0;
>  	long ns = to ? to->tv_nsec : 0;
>  	int r = -ENOSYS;
> -	if (SYS_futex == SYS_futex_time64 || !IS32BIT(s))
> -		r = __syscall(SYS_futex_time64, addr, op, val,
> -			to ? ((long long[]){s, ns}) : 0);
> -	if (SYS_futex == SYS_futex_time64 || r!=-ENOSYS) return r;
> +	if (use_futex_time64) {
> +		if (SYS_futex == SYS_futex_time64 || !IS32BIT(s))
> +			r = __syscall(SYS_futex_time64, addr, op, val,
> +				to ? ((long long[]){s, ns}) : 0);
> +		if (SYS_futex == SYS_futex_time64 || r!=-ENOSYS) return r;
> +		else use_futex_time64 = 0;
> +	}
>  	to = to ? (void *)(long[]){CLAMP(s), ns} : 0;
>  #endif
>  	return __syscall(SYS_futex, addr, op, val, to);

Not needed.

> diff --git a/src/time/clock_gettime.c b/src/time/clock_gettime.c
> index 3e1d0975..736fde8b 100644
> --- a/src/time/clock_gettime.c
> +++ b/src/time/clock_gettime.c
> @@ -73,11 +73,15 @@ int __clock_gettime(clockid_t clk, struct timespec *ts)
>  #endif
>  
>  #ifdef SYS_clock_gettime64
> +	static int use_clock_gettime64 = 1;
>  	r = -ENOSYS;
> -	if (sizeof(time_t) > 4)
> -		r = __syscall(SYS_clock_gettime64, clk, ts);
> -	if (SYS_clock_gettime == SYS_clock_gettime64 || r!=-ENOSYS)
> -		return __syscall_ret(r);
> +	if (use_clock_gettime64) {
> +		if (sizeof(time_t) > 4)
> +			r = __syscall(SYS_clock_gettime64, clk, ts);
> +		if (SYS_clock_gettime == SYS_clock_gettime64 || r!=-ENOSYS)
> +			return __syscall_ret(r);
> +		else use_clock_gettime64 = 0;
> +	}
>  	long ts32[2];
>  	r = __syscall(SYS_clock_gettime, clk, ts32);
>  	if (r==-ENOSYS && clk==CLOCK_REALTIME) {

Needed; it's reading back time.

> diff --git a/src/time/clock_nanosleep.c b/src/time/clock_nanosleep.c
> index e195499c..bbb657b6 100644
> --- a/src/time/clock_nanosleep.c
> +++ b/src/time/clock_nanosleep.c
> @@ -9,14 +9,18 @@ int __clock_nanosleep(clockid_t clk, int flags, const struct timespec *req, stru
>  {
>  	if (clk == CLOCK_THREAD_CPUTIME_ID) return EINVAL;
>  #ifdef SYS_clock_nanosleep_time64
> +	static int use_clock_nanosleep_time64 = 1;
>  	time_t s = req->tv_sec;
>  	long ns = req->tv_nsec;
>  	int r = -ENOSYS;
> -	if (SYS_clock_nanosleep == SYS_clock_nanosleep_time64 || !IS32BIT(s))
> -		r = __syscall_cp(SYS_clock_nanosleep_time64, clk, flags,
> -			((long long[]){s, ns}), rem);
> -	if (SYS_clock_nanosleep == SYS_clock_nanosleep_time64 || r!=-ENOSYS)
> -		return -r;
> +	if (use_clock_nanosleep_time64) {
> +		if (SYS_clock_nanosleep == SYS_clock_nanosleep_time64 || !IS32BIT(s))
> +			r = __syscall_cp(SYS_clock_nanosleep_time64, clk, flags,
> +				((long long[]){s, ns}), rem);
> +		if (SYS_clock_nanosleep == SYS_clock_nanosleep_time64 || r!=-ENOSYS)
> +			return -r;
> +		else use_clock_nanosleep_time64 = 0;
> +	}
>  	long long extra = s - CLAMP(s);
>  	long ts32[2] = { CLAMP(s), ns };
>  	if (clk == CLOCK_REALTIME && !flags)

Not needed.

> diff --git a/src/time/clock_settime.c b/src/time/clock_settime.c
> index 1004ed15..aa137fa7 100644
> --- a/src/time/clock_settime.c
> +++ b/src/time/clock_settime.c
> @@ -7,14 +7,18 @@
>  int clock_settime(clockid_t clk, const struct timespec *ts)
>  {
>  #ifdef SYS_clock_settime64
> +	static int use_clock_settime64 = 1;
>  	time_t s = ts->tv_sec;
>  	long ns = ts->tv_nsec;
>  	int r = -ENOSYS;
> -	if (SYS_clock_settime == SYS_clock_settime64 || !IS32BIT(s))
> -		r = __syscall(SYS_clock_settime64, clk,
> -			((long long[]){s, ns}));
> -	if (SYS_clock_settime == SYS_clock_settime64 || r!=-ENOSYS)
> -		return __syscall_ret(r);
> +	if (use_clock_settime64) {
> +		if (SYS_clock_settime == SYS_clock_settime64 || !IS32BIT(s))
> +			r = __syscall(SYS_clock_settime64, clk,
> +				((long long[]){s, ns}));
> +		if (SYS_clock_settime == SYS_clock_settime64 || r!=-ENOSYS)
> +			return __syscall_ret(r);
> +		else use_clock_settime64 = 0;
> +	}
>  	if (!IS32BIT(s))
>  		return __syscall_ret(-ENOTSUP);
>  	return syscall(SYS_clock_settime, clk, ((long[]){s, ns}));
> -- 
> 2.20.1
> 


Not needed.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [musl] Backwards kernel compatibility
  2021-06-02 11:52         ` Arnd Bergmann
  2021-06-02 14:56           ` Rich Felker
@ 2021-06-09  7:03           ` Arnd Bergmann
  1 sibling, 0 replies; 13+ messages in thread
From: Arnd Bergmann @ 2021-06-09  7:03 UTC (permalink / raw)
  To: musl; +Cc: Rich Felker, Markus Wichmann, Florian Weimer

On Wed, Jun 2, 2021 at 1:52 PM Arnd Bergmann <arnd@kernel.org> wrote:
> On Wed, Jun 2, 2021 at 9:38 AM Martin Vajnar <martin.vajnar@gmail.com> wrote:

> >
> > The main source of overhead comes from the kernel 4.4 which on arm64
> > produces stack traces when not implemented syscall is invoked:
> >
> >     https://github.com/torvalds/linux/blob/afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc/arch/arm64/kernel/traps.c#L369
>
> That is clearly a bug that was fixed in mainline and backported to linux-4.14
> but not 4.4 or 4.9. I've sent a manual backport for inclusion in those kernels
> now.

The backport is now merged into the stable kernel trees and should be part of
4.4.272/4.9.272.

        Arnd

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-06-09  7:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-10  5:50 [musl] Backwards kernel compatibility Martin Vajnar
2021-05-10  6:46 ` Florian Weimer
2021-05-10 18:58 ` Markus Wichmann
2021-05-24 13:52   ` Martin Vajnar
2021-05-24 22:00     ` Rich Felker
2021-06-02  7:38       ` Martin Vajnar
2021-06-02 11:52         ` Arnd Bergmann
2021-06-02 14:56           ` Rich Felker
2021-06-02 16:01             ` Arnd Bergmann
2021-06-02 16:18               ` Arnd Bergmann
2021-06-09  7:03           ` Arnd Bergmann
2021-06-08 22:16       ` Martin Vajnar
2021-06-09  0:37         ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).