ENOSYS/EOPNOTSUPP fallback?

mailing list of musl libc
 help / color / mirror / code / Atom feed

* ENOSYS/EOPNOTSUPP fallback?
@ 2017-06-05  3:22 Benjamin Slade
  2017-06-05 12:46 ` Joakim Sindholt
       [not found] ` <b6bc4261.dNq.dMV.B.pUrCBw@mailjet.com>
  0 siblings, 2 replies; 5+ messages in thread
From: Benjamin Slade @ 2017-06-05  3:22 UTC (permalink / raw)
  To: musl

I ran into what is perhaps a weird edge case. I'm running a system with
musl that uses a ZFS root fs. When I was trying to install some
flatpaks, I got an `fallocate` failure, with no `dd` fallback. Querying
the flatpak team, the fallback to `dd` seems to be something which glibc
does (and so the other components assume will be taken care).

Here is the exchange regarding this issue:
https://github.com/flatpak/flatpak/issues/802

Please CC me if relevant @ slade@jnanam.net

--
( Dr Benjamin Slade . b.slade@utah.edu . http://www.jnanam.net/slade )
 ( Linguistics . University of Utah . http://linguistics.utah.edu )
 ( office : LNCO 2309 )
`( pgp_fp: ,(21BA 2AE1 28F6 DF36 110A 0E9C A320 BBE8 2B52 EE19))
'( sent by mu4e on Emacs running under GNU/Linux . https://gnu.org )

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ENOSYS/EOPNOTSUPP fallback?
  2017-06-05  3:22 ENOSYS/EOPNOTSUPP fallback? Benjamin Slade
@ 2017-06-05 12:46 ` Joakim Sindholt
       [not found] ` <b6bc4261.dNq.dMV.B.pUrCBw@mailjet.com>
  1 sibling, 0 replies; 5+ messages in thread
From: Joakim Sindholt @ 2017-06-05 12:46 UTC (permalink / raw)
  To: musl, slade

On Sun, Jun 04, 2017 at 09:22:27PM -0600, Benjamin Slade wrote:
> I ran into what is perhaps a weird edge case. I'm running a system with
> musl that uses a ZFS root fs. When I was trying to install some
> flatpaks, I got an `fallocate` failure, with no `dd` fallback. Querying
> the flatpak team, the fallback to `dd` seems to be something which glibc
> does (and so the other components assume will be taken care).
> 
> Here is the exchange regarding this issue:
> https://github.com/flatpak/flatpak/issues/802

To quote the glibc source file linked in the bug:

  /* Minimize data transfer for network file systems, by issuing
     single-byte write requests spaced by the file system block size.
     (Most local file systems have fallocate support, so this fallback
     code is not used there.)  */

  /* NFS clients do not propagate the block size of the underlying
     storage and may report a much larger value which would still
     leave holes after the loop below, so we cap the increment at
     4096.  */

  /* Write a null byte to every block.  This is racy; we currently
     lack a better option.  Compare-and-swap against a file mapping
     might address local races, but requires interposition of a signal
     handler to catch SIGBUS.  */

Which leaves 2 massive bugs:
1) the leaving of unallocated gaps both because of the NFS thing but
also because other file systems may work on entirely different
principles that are not accounted for here and
2) overwriting data currently being written to the file as it's being
forcibly allocated (which might be doing nothing, think deduplication).

This is not a viable general solution and furthermore fallocate is
mostly just an optimization hint. If it's a hard requirement of your
software I would suggest implementing it in your file system. These
operations can only be safely implemented in the kernel.

An example:

MyFS uses write time deduplication on unused blocks (and blocks with all
zeroes fall under the umbrella of unused). Glibc starts its dance where
it writes a zero byte to the beginning of each block it perceives and
for now let's just say it has the right block size. MyFS just trashes
these writes immediately without touching the disk and updates the size
metadata which gets lazily written at some point. There's only 400k left
on the disk and your fallocate of 16G will succeed and run exceptionally
fast to boot, but it will have allocated nothing and your next write
fails with ENOSPC.

Another example:

myutil has 2 threads running. One thread is constantly writing things to
a file. The other thread sometimes writes large chunks of data to the
file and so it hints the kernel to allocate these large chunks by
calling fallocate, and only then taking the lock(s) held internally to
synchronize the threads. The first thread finds it needs to update
something in the section currently being fallocated by glibc's
algorithm. Suddenly zero bytes appear at 4k intervals for no discernible
reason, overwriting the data.

Personally I would look into seeing to it that flatpak only uses
fallocate as an optimization. The most reliable thing I can think of
otherwise would be to do the locking necessary (if any) in the program
and filling the entire target section of the file with data from
/dev/urandom, but even that may fail spectacularly with transparent
compression (albeit unlikely).

Hope this was at least somewhat helpful.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ENOSYS/EOPNOTSUPP fallback?
       [not found] ` <b6bc4261.dNq.dMV.B.pUrCBw@mailjet.com>
@ 2017-06-11 20:57   ` Benjamin Slade
  2017-06-12 17:55     ` Joakim Sindholt
  0 siblings, 1 reply; 5+ messages in thread
From: Benjamin Slade @ 2017-06-11 20:57 UTC (permalink / raw)
  To: Joakim Sindholt; +Cc: musl

Thank you for the extensive reply.

Just to be clear: I'm just an end-user of flatpak, &c. As far as I can
tell, flatpak is making use of `ostree` which assumes that the libc will
take care of handling `dd` fallback (I got the impression that flatpak
isn't directly calling `fallocate` itself).

Do you think there's an obvious avenue for following up on this?
Admittedly this is an edge-case that won't necessarily affect musl users
on ext4, but it will affect musl users on zfs (and I believe
f2fs). Do you think `ostree` shouldn't rely on the libc for fallback? Or
should ZFS on Linux implement a fallback for fallocate?

--
Benjamin Slade
  `(pgp_fp: ,(21BA 2AE1 28F6 DF36 110A 0E9C A320 BBE8 2B52 EE19))
    '(sent by mu4e on Emacs running under GNU/Linux . https://gnu.org )
       '(Choose Linux, Choose Freedom . https://linux.com )


On 2017-06-05T06:46:33-0600, Joakim Sindholt <opensource@zhasha.com> wrote:

 > On Sun, Jun 04, 2017 at 09:22:27PM -0600, Benjamin Slade wrote:
 > > I ran into what is perhaps a weird edge case. I'm running a system with
 > > musl that uses a ZFS root fs. When I was trying to install some
 > > flatpaks, I got an `fallocate` failure, with no `dd` fallback. Querying
 > > the flatpak team, the fallback to `dd` seems to be something which glibc
 > > does (and so the other components assume will be taken care).
 > >
 > > Here is the exchange regarding this issue:
 > > https://github.com/flatpak/flatpak/issues/802

 > To quote the glibc source file linked in the bug:

 >   /* Minimize data transfer for network file systems, by issuing
 >      single-byte write requests spaced by the file system block size.
 >      (Most local file systems have fallocate support, so this fallback
 >      code is not used there.)  */

 >   /* NFS clients do not propagate the block size of the underlying
 >      storage and may report a much larger value which would still
 >      leave holes after the loop below, so we cap the increment at
 >      4096.  */

 >   /* Write a null byte to every block.  This is racy; we currently
 >      lack a better option.  Compare-and-swap against a file mapping
 >      might address local races, but requires interposition of a signal
 >      handler to catch SIGBUS.  */

 > Which leaves 2 massive bugs:
 > 1) the leaving of unallocated gaps both because of the NFS thing but
 > also because other file systems may work on entirely different
 > principles that are not accounted for here and
 > 2) overwriting data currently being written to the file as it's being
 > forcibly allocated (which might be doing nothing, think deduplication).

 > This is not a viable general solution and furthermore fallocate is
 > mostly just an optimization hint. If it's a hard requirement of your
 > software I would suggest implementing it in your file system. These
 > operations can only be safely implemented in the kernel.

 > An example:

 > MyFS uses write time deduplication on unused blocks (and blocks with all
 > zeroes fall under the umbrella of unused). Glibc starts its dance where
 > it writes a zero byte to the beginning of each block it perceives and
 > for now let's just say it has the right block size. MyFS just trashes
 > these writes immediately without touching the disk and updates the size
 > metadata which gets lazily written at some point. There's only 400k left
 > on the disk and your fallocate of 16G will succeed and run exceptionally
 > fast to boot, but it will have allocated nothing and your next write
 > fails with ENOSPC.

 > Another example:

 > myutil has 2 threads running. One thread is constantly writing things to
 > a file. The other thread sometimes writes large chunks of data to the
 > file and so it hints the kernel to allocate these large chunks by
 > calling fallocate, and only then taking the lock(s) held internally to
 > synchronize the threads. The first thread finds it needs to update
 > something in the section currently being fallocated by glibc's
 > algorithm. Suddenly zero bytes appear at 4k intervals for no discernible
 > reason, overwriting the data.


 > Personally I would look into seeing to it that flatpak only uses
 > fallocate as an optimization. The most reliable thing I can think of
 > otherwise would be to do the locking necessary (if any) in the program
 > and filling the entire target section of the file with data from
 > /dev/urandom, but even that may fail spectacularly with transparent
 > compression (albeit unlikely).

 > Hope this was at least somewhat helpful.





^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ENOSYS/EOPNOTSUPP fallback?
  2017-06-11 20:57   ` Benjamin Slade
@ 2017-06-12 17:55     ` Joakim Sindholt
  2017-06-12 18:07       ` Rich Felker
  0 siblings, 1 reply; 5+ messages in thread
From: Joakim Sindholt @ 2017-06-12 17:55 UTC (permalink / raw)
  To: musl, slade

On Sun, Jun 11, 2017 at 02:57:59PM -0600, Benjamin Slade wrote:
> Thank you for the extensive reply.
> 
> Just to be clear: I'm just an end-user of flatpak, &c. As far as I can
> tell, flatpak is making use of `ostree` which assumes that the libc will
> take care of handling `dd` fallback (I got the impression that flatpak
> isn't directly calling `fallocate` itself).

I don't think it's fair to say that they depend on the fallback. POSIX
is very clear that posix_fallocate doesn't fail in the way musl fails
here[1]. They (hopefully) expect it to behave as described in the
standard and there's not much musl can do to alleviate the problem.

> Do you think there's an obvious avenue for following up on this?
> Admittedly this is an edge-case that won't necessarily affect musl users
> on ext4, but it will affect musl users on zfs (and I believe
> f2fs). Do you think `ostree` shouldn't rely on the libc for fallback? Or
> should ZFS on Linux implement a fallback for fallocate?

The reason I recommended using fallocate in a way where failure is
non-fatal is that it's probably going to be a pain to fix it properly in
the kernel. After having a look at zfsonlinux[2] it's not at all clear
how much work it would be. Currently they only support calling
fallocate(2) with parameters to deallocate a section of a file because
that seems to have been the only low hanging fruit.

Ultimately, ostree isn't doing anything wrong by expecting it to work,
but it might not be something they depend on succeeding internally. In
which case the easy fix for your particular case is to just make the
failure non-fatal, which is probably really easy.

If you want to fix it properly the only option I can see is to fix it in
the driver. I don't think any userspace level hack is going to be
upstreamable in musl, as it would violate the standard in very bad ways
AND appear to work.

Most people here are really big on correctness and I think it would be
really cool to see it fixed in zfs :)

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_fallocate.html
[2] https://github.com/zfsonlinux/zfs/commit/cb2d19010d8fbcf6c22585cd8763fad3ba7db724

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ENOSYS/EOPNOTSUPP fallback?
  2017-06-12 17:55     ` Joakim Sindholt
@ 2017-06-12 18:07       ` Rich Felker
  0 siblings, 0 replies; 5+ messages in thread
From: Rich Felker @ 2017-06-12 18:07 UTC (permalink / raw)
  To: musl

On Mon, Jun 12, 2017 at 07:55:20PM +0200, Joakim Sindholt wrote:
> On Sun, Jun 11, 2017 at 02:57:59PM -0600, Benjamin Slade wrote:
> > Thank you for the extensive reply.
> > 
> > Just to be clear: I'm just an end-user of flatpak, &c. As far as I can
> > tell, flatpak is making use of `ostree` which assumes that the libc will
> > take care of handling `dd` fallback (I got the impression that flatpak
> > isn't directly calling `fallocate` itself).
> 
> I don't think it's fair to say that they depend on the fallback. POSIX
> is very clear that posix_fallocate doesn't fail in the way musl fails
> here[1]. They (hopefully) expect it to behave as described in the
> standard and there's not much musl can do to alleviate the problem.

I don't follow what you mean by "POSIX is very clear...". Any
interface that has defined errors is permitted by POSIX to fail for
other implementation-defined reasons as long as the error codes used
for those reasons don't clash with the standard errors. In any case
there is no way musl can implement posix_fallocate if the underlying
kernel/filesystem does not support it.

I followed up on the flatpak bug tracker thread with some additional
info. But I'm not clear what functionality they actually need from
posix_fallocate because I don't even know what they're doing with it.

Rich

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-06-12 18:07 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-05  3:22 ENOSYS/EOPNOTSUPP fallback? Benjamin Slade
2017-06-05 12:46 ` Joakim Sindholt
     [not found] ` <b6bc4261.dNq.dMV.B.pUrCBw@mailjet.com>
2017-06-11 20:57   ` Benjamin Slade
2017-06-12 17:55     ` Joakim Sindholt
2017-06-12 18:07       ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).