Deadlock when calling fflush/fclose in multiple threads

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Deadlock when calling fflush/fclose in multiple threads
@ 2018-11-02  5:11 dirk
  0 siblings, 0 replies; 8+ messages in thread
From: dirk @ 2018-11-02  5:11 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 7195 bytes --]

Hi,

We got deadlock on fflush/fclose with musl-1.1.19 (openwrt 18.06). Actually we using lua's popen in mutiple threads, following is gdb trace.

I am new to musl libc source code, fflush(NULL) will call __ofl_lock and then try to lock and flush every stream, fclose will lock the stream and then __ofl_lock.  The question is the fflush/fclose api thread-safe?  What i have got from man document is that linux fflush/fclose is thread-safe api.

(gdb) info thread
  Id   Target Id         Frame 
* 1    LWP 2772 "skynet" __cp_end () at src/thread/i386/syscall_cp.s:31
  2    LWP 2776 "skynet" __cp_end () at src/thread/i386/syscall_cp.s:31
  3    LWP 2777 "skynet" __cp_end () at src/thread/i386/syscall_cp.s:31
  4    LWP 2778 "skynet" 0xb7f95ad9 in __kernel_vsyscall ()
  5    LWP 2779 "skynet" __cp_end () at src/thread/i386/syscall_cp.s:31
  6    LWP 2780 "skynet" 0xb7f95ad9 in __kernel_vsyscall ()
  7    LWP 2781 "skynet" 0xb7f95ad9 in __kernel_vsyscall ()
  8    LWP 2782 "skynet" __cp_end () at src/thread/i386/syscall_cp.s:31
(gdb) thread 6
[Switching to thread 6 (LWP 2780)]
#0  0xb7f95ad9 in __kernel_vsyscall ()
(gdb) bt
#0  0xb7f95ad9 in __kernel_vsyscall ()
#1  0xb7fb1314 in __vsyscall () at src/internal/i386/syscall.s:19
#2  0xb7fe29a9 in __syscall4 (a4=0, a3=2781, a2=128, a1=136262812, n=240) at ./arch/i386/syscall_arch.h:37
#3  __wait (addr=0x81f349c, waiters=0x81f34a0, val=2781, priv=128) at src/thread/__wait.c:13
#4  0xb7fd7a52 in __lockfile (f=0x81f3450) at src/stdio/__lockfile.c:10
#5  0xb7fd829d in fflush (f=0x81f3450) at src/stdio/fflush.c:13
#6  0x0807129e in io_popen (L=0x81e5f04) at liolib.c:282
#7  0x0805a064 in luaD_precall (L=0x81e5f04, func=0x81e638c, nresults=2) at ldo.c:434
#8  0x080661f0 in luaV_execute (L=0x81e5f04) at lvm.c:1146
#9  0x08059dc1 in unroll (L=0x81e5f04, ud=0x0) at ldo.c:556
#10 0x080596bb in luaD_rawrunprotected (L=0x81e5f04, f=0x805a250 <resume>, ud=0xb778a5f8) at ldo.c:142
#11 0x0805a46d in lua_resume (L=0x81e5f04, from=0x81d8bd4, nargs=<optimized out>) at ldo.c:664
#12 0x0806f252 in auxresume (L=L@entry=0x81d8bd4, co=co@entry=0x81e5f04, narg=3) at lcorolib.c:39
#13 0x0806f5a5 in luaB_coresume (L=0x81d8bd4) at lcorolib.c:60
#14 0xb7732ca9 in timing_resume (L=0x81d8bd4) at lualib-src/lua-profile.c:134
#15 0x0805a064 in luaD_precall (L=0x81d8bd4, func=0x81e4144, nresults=-1) at ldo.c:434
#16 0x08066686 in luaV_execute (L=0x81d8bd4) at lvm.c:1162
#17 0x0805a348 in luaD_call (L=0x81d8bd4, func=0x81e4084, nResults=-1) at ldo.c:499
#18 0x0805a3ac in luaD_callnoyield (L=0x81d8bd4, func=0x81e4084, nResults=-1) at ldo.c:509
#19 0x080560e8 in f_call (L=0x81d8bd4, ud=0xb778a8e8) at lapi.c:944
#20 0x080596bb in luaD_rawrunprotected (L=0x81d8bd4, f=0x80560d0 <f_call>, ud=0xb778a8e8) at ldo.c:142
#21 0x0805a656 in luaD_pcall (L=0x81d8bd4, func=0x80560d0 <f_call>, u=0xb778a8e8, old_top=132, ef=0) at ldo.c:729
#22 0x08057969 in lua_pcallk (L=0x81d8bd4, nargs=5, nresults=-1, errfunc=0, ctx=0, k=0x806f070 <finishpcall>) at lapi.c:970
#23 0x0806f116 in luaB_pcall (L=0x81d8bd4) at lbaselib.c:424
#24 0x0805a064 in luaD_precall (L=0x81d8bd4, func=0x81e406c, nresults=2) at ldo.c:434
#25 0x080661f0 in luaV_execute (L=0x81d8bd4) at lvm.c:1146
#26 0x0805a348 in luaD_call (L=0x81d8bd4, func=0x81e4024, nResults=0) at ldo.c:499
#27 0x0805a3ac in luaD_callnoyield (L=0x81d8bd4, func=0x81e4024, nResults=0) at ldo.c:509
#28 0x080560e8 in f_call (L=0x81d8bd4, ud=0xb778abb8) at lapi.c:944
#29 0x080596bb in luaD_rawrunprotected (L=0x81d8bd4, f=0x80560d0 <f_call>, ud=0xb778abb8) at ldo.c:142
#30 0x0805a656 in luaD_pcall (L=0x81d8bd4, func=0x80560d0 <f_call>, u=0xb778abb8, old_top=36, ef=12) at ldo.c:729
#31 0x08057969 in lua_pcallk (L=0x81d8bd4, nargs=5, nresults=0, errfunc=1, ctx=0, k=0x0) at lapi.c:970
#32 0xb772978e in _cb (context=0x81da400, ud=0x81d8bd4, type=1, session=2, source=3, msg=0x81b1080, sz=2) at lualib-src/lua-skynet.c:75
#33 0x0804ecb1 in dispatch_message (ctx=ctx@entry=0x81da400, msg=msg@entry=0xb778ac80) at skynet-src/skynet_server.c:274
#34 0x0804f97a in skynet_context_message_dispatch (sm=0x815b2a0, q=0x81da470, weight=-1) at skynet-src/skynet_server.c:334
#35 0x080501a9 in thread_worker (p=0xbffffb9c) at skynet-src/skynet_start.c:163
#36 0xb7fe3fc5 in start (p=0xb778ad54) at src/thread/pthread_create.c:150
#37 0xb7fe2b3a in __clone () at src/thread/i386/clone.s:36
#38 0x00000000 in ?? ()
(gdb) thread 7
[Switching to thread 7 (LWP 2781)]
#0  0xb7f95ad9 in __kernel_vsyscall ()
(gdb) bt
#0  0xb7f95ad9 in __kernel_vsyscall ()
#1  0xb7fb1314 in __vsyscall () at src/internal/i386/syscall.s:19
#2  0xb7fe278b in __syscall3 (a3=-2147483646, a2=128, a1=-1207963496, n=240) at ./arch/i386/syscall_arch.h:30
#3  __futexwait (priv=128, val=-2147483646, addr=0xb7fff098 <ofl_lock>) at ./src/internal/pthread_impl.h:150
#4  __lock (l=0xb7fff098 <ofl_lock>) at src/thread/__lock.c:42
#5  0xb7fd9f30 in __ofl_lock () at src/stdio/ofl.c:9
#6  0xb7fd8144 in fclose (f=0x81f3450) at src/stdio/fclose.c:17
#7  0xb7fda4e2 in pclose (f=0x81f3450) at src/stdio/pclose.c:9
#8  0x0807130e in io_pclose (L=0x81f3034) at liolib.c:273
#9  0x0807090a in aux_close (L=0x81f3034) at liolib.c:205
#10 0x0805a064 in luaD_precall (L=0x81f3034, func=0x81f312c, nresults=0) at ldo.c:434
#11 0x080661f0 in luaV_execute (L=0x81f3034) at lvm.c:1146
#12 0x080596bb in luaD_rawrunprotected (L=0x81f3034, f=0x805a250 <resume>, ud=0xb77748c8) at ldo.c:142
#13 0x0805a46d in lua_resume (L=0x81f3034, from=0x81e6654, nargs=<optimized out>) at ldo.c:664
#14 0x0806f252 in auxresume (L=L@entry=0x81e6654, co=co@entry=0x81f3034, narg=0) at lcorolib.c:39
#15 0x0806f5a5 in luaB_coresume (L=0x81e6654) at lcorolib.c:60
#16 0xb7732ca9 in timing_resume (L=0x81e6654) at lualib-src/lua-profile.c:134
#17 0x0805a064 in luaD_precall (L=0x81e6654, func=0x81f09e4, nresults=-1) at ldo.c:434
#18 0x08066686 in luaV_execute (L=0x81e6654) at lvm.c:1162
#19 0x0805a348 in luaD_call (L=0x81e6654, func=0x81f0924, nResults=0) at ldo.c:499
#20 0x0805a3ac in luaD_callnoyield (L=0x81e6654, func=0x81f0924, nResults=0) at ldo.c:509
#21 0x080560e8 in f_call (L=0x81e6654, ud=0xb7774bb8) at lapi.c:944
#22 0x080596bb in luaD_rawrunprotected (L=0x81e6654, f=0x80560d0 <f_call>, ud=0xb7774bb8) at ldo.c:142
#23 0x0805a656 in luaD_pcall (L=0x81e6654, func=0x80560d0 <f_call>, u=0xb7774bb8, old_top=36, ef=12) at ldo.c:729
#24 0x08057969 in lua_pcallk (L=0x81e6654, nargs=5, nresults=0, errfunc=1, ctx=0, k=0x0) at lapi.c:970
#25 0xb772978e in _cb (context=0x81e6e00, ud=0x81e6654, type=1, session=1, source=0, msg=0x0, sz=0) at lualib-src/lua-skynet.c:75
#26 0x0804ecb1 in dispatch_message (ctx=ctx@entry=0x81e6e00, msg=msg@entry=0xb7774c80) at skynet-src/skynet_server.c:274
#27 0x0804f97a in skynet_context_message_dispatch (sm=0x815b2c0, q=0x81e6ec0, weight=-1) at skynet-src/skynet_server.c:334
#28 0x080501a9 in thread_worker (p=0xbffffba8) at skynet-src/skynet_start.c:163
#29 0xb7fe3fc5 in start (p=0xb7774d54) at src/thread/pthread_create.c:150
#30 0xb7fe2b3a in __clone () at src/thread/i386/clone.s:36
#31 0x00000000 in ?? ()
(gdb) 



dirk@kooiot.com

[-- Attachment #2: Type: text/html, Size: 9518 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Deadlock when calling fflush/fclose in multiple threads
@ 2018-11-02 15:59 dirk
  0 siblings, 0 replies; 8+ messages in thread
From: dirk @ 2018-11-02 15:59 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/html, Size: 2977 bytes --]

[-- Attachment #2: Type: text/plain, Size: 2757 bytes --]

In our case, fflush(NULL) is called in lua's popen. free/df/loadavg will be executed in popen to collect device status in multiple threads, for each five seconds.  And the case for me to reproduce this bug is calling popen In infinite loop in two threads, the deadlock happens within one second in our application.

This deadlock happens on x86-64 openwrt (virtual machine), and we haven't meet this deadlock in nxp imx6ul(arm-cortex-a9) and tplink 702n(mips24_kc). I was thought it might caused by 64bits, and we used glibc for those amd64 vms.  We started port our application to Raspberry 3B＋from last month, we got same deaklock(the executed commands become zombie). And i got chance to loop at this issue again.



来自 魅族 PRO 6s

-------- 原始邮件 --------
发件人：Rich Felker <dalias@libc.org>
时间：2018年11月2日 22:29
收件人：dirk@kooiot.com
抄送：musl <musl@lists.openwall.com>
主题：Re: [musl] Deadlock when calling fflush/fclose in multiple threads

>On Fri, Nov 02, 2018 at 01:11:00PM +0800, dirk@kooiot.com wrote:
>> Hi,
>> 
>> We got deadlock on fflush/fclose with musl-1.1.19 (openwrt 18.06).
>> Actually we using lua's popen in mutiple threads, following is gdb
>> trace.
>> 
>> I am new to musl libc source code, fflush(NULL) will call __ofl_lock
>> and then try to lock and flush every stream, fclose will lock the
>> stream and then __ofl_lock. The question is the fflush/fclose api
>> thread-safe? What i have got from man document is that linux
>> fflush/fclose is thread-safe api.
>
>Your analysis is exactly correct. Calling fflush(NULL) frequently (or
>at all) is a really bad idea because of how it scales and how
>serializing it is, but it is valid, and the deadlock is a bug in musl.
>
>The current placement of the ofl update seems to have been based on
>minimizing how serializing fclose is, and on avoiding taking the
>global lock for F_PERM (stdin/out/err) FILEs (which is largely a
>useless optimization since the operation can happen at most 3 times).
>Just moving it above the FLOCK (and making it not conditional on
>F_PERM, to avoid data races) would solve this, but there's a deeper
>bug here too.
>
>By removing the FILE being closed from the open file list (and
>unlocking the open file list, without which the removal can't be seen)
>before it's flushed and closed, fclose creates a race window where
>fflush(NULL) or exit() from another thread can complete without this
>file being flushed, potentially causing data loss.
>
>I think we just have to move the __ofl_lock to the top of the
>function, before FLOCK, and the __ofl_unlock to after the
>fflush/close. Unfortunately this makes fclose much more serializing
>than it was before, but I don't see any way to avoid it.
>
>Rich
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Deadlock when calling fflush/fclose in multiple threads
  2018-11-02 15:37       ` John Starks
@ 2018-11-02 15:46         ` Rich Felker
  0 siblings, 0 replies; 8+ messages in thread
From: Rich Felker @ 2018-11-02 15:46 UTC (permalink / raw)
  To: musl

On Fri, Nov 02, 2018 at 03:37:18PM +0000, John Starks wrote:
> > 
> > I think such an approach is plausible, but involves the kind of
> > complex and error-prone direct use of atomics I'm actively trying to
> > eliminate. The same could be done without low level hacks via clever
> > use of rwlocks or a mutex+condvar pair, but all of these involve
> > namespace-safety issues and a lot more code than should be introduced
> > into minimal static programs using stdio.
> > 
> > For what it's worth, the only consumers of the open file list that can
> > be executed more than once are fflush(NULL), fclose, and __ofl_add
> > (used by fopen, etc.).
> > 
> > Rich
> 
> Yeah, I can see the desire to avoid additional complexity. It would
> be a shame, though, for fopen or fclose of a local file to block
> behind flushing during an fclose of a file on a network file system,
> for example.

Indeed. Actually it seems to be a show-stopping bug, but I have a
better solution. See the other branch of this thread.

Rich


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Deadlock when calling fflush/fclose in multiple threads
  2018-11-02 14:29 ` Rich Felker
  2018-11-02 15:15   ` John Starks
@ 2018-11-02 15:45   ` Rich Felker
  1 sibling, 0 replies; 8+ messages in thread
From: Rich Felker @ 2018-11-02 15:45 UTC (permalink / raw)
  To: dirk; +Cc: musl

On Fri, Nov 02, 2018 at 10:29:15AM -0400, Rich Felker wrote:
> By removing the FILE being closed from the open file list (and
> unlocking the open file list, without which the removal can't be seen)
> before it's flushed and closed, fclose creates a race window where
> fflush(NULL) or exit() from another thread can complete without this
> file being flushed, potentially causing data loss.
> 
> I think we just have to move the __ofl_lock to the top of the
> function, before FLOCK, and the __ofl_unlock to after the
> fflush/close. Unfortunately this makes fclose much more serializing
> than it was before, but I don't see any way to avoid it.

Another possibility seems to be moving the ofl lock and after the
fflush, close, and FUNLOCK, but before the free. This leaves a 'dead'
FILE in the open file list momentarily, but the only things that can
act on it are pthread_create's init_file_lock, __stdio_exit's
close_file, and fflush(NULL), and none of these can have any side
effects except on a FILE with buffered data (which the FILE being
closed can't have at this point).

I think I like this solution better, and I think it's necessary to do
something other than the above-quoted idea; holding the ofl lock
during flush can itself cause deadlock, since the flush could block
and forward progress of whatever has it blocked (e.g. other end of
socket or pipe) could depend on forward progress of fclose, fopen,
etc. in another thread. Also, in light of having added support for
application-provided buffers with setvbuf, even on regular files the
flush operation could take a long time.

Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Deadlock when calling fflush/fclose in multiple threads
  2018-11-02 15:33     ` Rich Felker
@ 2018-11-02 15:37       ` John Starks
  2018-11-02 15:46         ` Rich Felker
  0 siblings, 1 reply; 8+ messages in thread
From: John Starks @ 2018-11-02 15:37 UTC (permalink / raw)
  To: musl

> 
> I think such an approach is plausible, but involves the kind of
> complex and error-prone direct use of atomics I'm actively trying to
> eliminate. The same could be done without low level hacks via clever
> use of rwlocks or a mutex+condvar pair, but all of these involve
> namespace-safety issues and a lot more code than should be introduced
> into minimal static programs using stdio.
> 
> For what it's worth, the only consumers of the open file list that can
> be executed more than once are fflush(NULL), fclose, and __ofl_add
> (used by fopen, etc.).
> 
> Rich

Yeah, I can see the desire to avoid additional complexity. It would be a shame, though, for fopen or fclose of a local file to block behind flushing during an fclose of a file on a network file system, for example.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Deadlock when calling fflush/fclose in multiple threads
  2018-11-02 15:15   ` John Starks
@ 2018-11-02 15:33     ` Rich Felker
  2018-11-02 15:37       ` John Starks
  0 siblings, 1 reply; 8+ messages in thread
From: Rich Felker @ 2018-11-02 15:33 UTC (permalink / raw)
  To: musl

On Fri, Nov 02, 2018 at 03:15:51PM +0000, John Starks wrote:
> > 
> > I think we just have to move the __ofl_lock to the top of the
> > function, before FLOCK, and the __ofl_unlock to after the
> > fflush/close. Unfortunately this makes fclose much more serializing
> > than it was before, but I don't see any way to avoid it.
> > 
> 
> Perhaps you could keep a global count of FILEs that are still being
> flushed after having been removed from the list. fflush could
> perform a futex wait on this becoming 0.

I think such an approach is plausible, but involves the kind of
complex and error-prone direct use of atomics I'm actively trying to
eliminate. The same could be done without low level hacks via clever
use of rwlocks or a mutex+condvar pair, but all of these involve
namespace-safety issues and a lot more code than should be introduced
into minimal static programs using stdio.

For what it's worth, the only consumers of the open file list that can
be executed more than once are fflush(NULL), fclose, and __ofl_add
(used by fopen, etc.).

Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Deadlock when calling fflush/fclose in multiple threads
  2018-11-02 14:29 ` Rich Felker
@ 2018-11-02 15:15   ` John Starks
  2018-11-02 15:33     ` Rich Felker
  2018-11-02 15:45   ` Rich Felker
  1 sibling, 1 reply; 8+ messages in thread
From: John Starks @ 2018-11-02 15:15 UTC (permalink / raw)
  To: musl, dirk

> 
> I think we just have to move the __ofl_lock to the top of the
> function, before FLOCK, and the __ofl_unlock to after the
> fflush/close. Unfortunately this makes fclose much more serializing
> than it was before, but I don't see any way to avoid it.
> 

Perhaps you could keep a global count of FILEs that are still being flushed after having been removed from the list. fflush could perform a futex wait on this becoming 0.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Deadlock when calling fflush/fclose in multiple threads
       [not found] <2018110213110009300613@kooiot.com>
@ 2018-11-02 14:29 ` Rich Felker
  2018-11-02 15:15   ` John Starks
  2018-11-02 15:45   ` Rich Felker
  0 siblings, 2 replies; 8+ messages in thread
From: Rich Felker @ 2018-11-02 14:29 UTC (permalink / raw)
  To: dirk; +Cc: musl

On Fri, Nov 02, 2018 at 01:11:00PM +0800, dirk@kooiot.com wrote:
> Hi,
> 
> We got deadlock on fflush/fclose with musl-1.1.19 (openwrt 18.06).
> Actually we using lua's popen in mutiple threads, following is gdb
> trace.
> 
> I am new to musl libc source code, fflush(NULL) will call __ofl_lock
> and then try to lock and flush every stream, fclose will lock the
> stream and then __ofl_lock. The question is the fflush/fclose api
> thread-safe? What i have got from man document is that linux
> fflush/fclose is thread-safe api.

Your analysis is exactly correct. Calling fflush(NULL) frequently (or
at all) is a really bad idea because of how it scales and how
serializing it is, but it is valid, and the deadlock is a bug in musl.

The current placement of the ofl update seems to have been based on
minimizing how serializing fclose is, and on avoiding taking the
global lock for F_PERM (stdin/out/err) FILEs (which is largely a
useless optimization since the operation can happen at most 3 times).
Just moving it above the FLOCK (and making it not conditional on
F_PERM, to avoid data races) would solve this, but there's a deeper
bug here too.

By removing the FILE being closed from the open file list (and
unlocking the open file list, without which the removal can't be seen)
before it's flushed and closed, fclose creates a race window where
fflush(NULL) or exit() from another thread can complete without this
file being flushed, potentially causing data loss.

I think we just have to move the __ofl_lock to the top of the
function, before FLOCK, and the __ofl_unlock to after the
fflush/close. Unfortunately this makes fclose much more serializing
than it was before, but I don't see any way to avoid it.

Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-11-02 15:59 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-02  5:11 Deadlock when calling fflush/fclose in multiple threads dirk
     [not found] <2018110213110009300613@kooiot.com>
2018-11-02 14:29 ` Rich Felker
2018-11-02 15:15   ` John Starks
2018-11-02 15:33     ` Rich Felker
2018-11-02 15:37       ` John Starks
2018-11-02 15:46         ` Rich Felker
2018-11-02 15:45   ` Rich Felker
2018-11-02 15:59 dirk

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).